CONCEPT DIABETES: Analytical pipeline
Introduction
CONCEPT-DIABETES
CONCEPT-DIABETES is part of CONCEPT, a coordinated research project initiative under the umbrella of REDISSEC, the Spanish Research Network on Health Services Research and Chronic Conditions] [www.redissec.com], that aims at analyzing chronic care effectiveness and efficiency in a number of cohorts built on real world data (RWD). In the specific case of CONCEPT-DIABETES, the focus will be on assessing the effectiveness of a set of clinical practice actions and quality improvement strategies at different levels (patient-level, health-care provider level and health system level) on the management and health results of patients with type 2 diabetes (T2D) using process mining methodology.
It is a population-based retrospective observational study centered on all T2D patients diagnosed in four Regional Health Services within the Spanish National Health Service, that includes information from all their contacts with the health services using the electronic medical record systems including Primary Care data, Specialist Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register. We will assess to what extent recommended interventions from evidence-based guidelines are implemented in real life and which are their effects on health outcomes. Process mining methods will be used to analyze the data, and comparison with standard methods will be also conducted.
Cohort
The cohort is defined as patients with type 2 diabetes:
Inclusion criteria: Patients that, at 2017-01-01 or during the follow-up from 2017-01-01 to 2022-12-31, had active health card (active TIA - tarjeta sanitaria activa) and one of the inclusion codes given in the ‘inclusion code list (’T90’ if CIAP-2, ‘250’ if ‘CIE-9CM’ or ‘E11’ if CIE-10-ES).
Exclusion criteria: Patients that had none of the exclusion codes given in the exclusion codes list (‘T89’ if CIAP-2, ‘250.01’ if CIE-9CM, ‘250.03’ if CIE-9CM, ‘250.11’ if CIE-9CM, ‘250.13’ if CIE-9CM, ‘250.21’ if CIE-9CM, ‘250.23’ if CIE-9CM, ‘250.31’ if CIE-9CM, ‘250.33’ if CIE-9CM, ‘250.41’ if CIE-9CM, ‘250.43’ if CIE-9CM, ‘250.51’ if CIE-9CM, ‘250.53’ if CIE-9CM, ‘250.61’ if CIE-9CM, ‘250.63’ if CIE-9CM, ‘250.71’ if CIE-9CM, ‘250.73’ if CIE-9CM, ‘250.81’ if CIE-9CM, ‘250.83’ if CIE-9CM, ‘250.91’ if CIE-9CM, ‘250.93’ if CIE-9CM or ‘E10’ if CIE-10-ES) during their follow-up or patients with no contact with the health system from 2017-01-01 to 2022-12-31.
Study period: 2017-01-01 until 2022-12-31.
Treatment guidelines
One of the main intermediate outcome indicators to which clinical guidelines pay special attention is a good glycaemic control, since its absence is clearly related to micro and macrovascular complications. In clinical practice, suboptimal glycaemic control can be mainly attributed to two main reasons: the patients’ non-adherence to prescribed treatment; and the healthcare providers’ clinical or therapeutic guidelines’ non-adherence.
Treatment decisions are based on glycated hemoglobin measurements. In this context the redGDPS foundation provides DM2 treatment algorithm, a diagram that aims to help professionals to quickly and easily choose the most appropriate treatment for people with diabetes.
Process indicators
Another measures to which different studies pay special attention are process indicators. First, process indicators are essential for assessing the three dimensions of healthcare quality: effectiveness, safety, and patient-centeredness. They are measured relatively easily and often do not require risk-adjustment, making their interpretation straightforward. Poor performance on process indicators can be directly attributed to the actions of healthcare providers, providing a clear indication for improvement, such as better adherence to clinical guidelines. Moreover, process indicators allow for the identification of areas for quality improvement, which is crucial in the complex healthcare environment. These indicators cover a wide range of health factors and help focus resources and efforts to improve the health and well-being of the population. Therefore, the attention given to health process indicators is justified by their crucial role in evaluating and improving healthcare quality and population health.
Running code
The python libraries used are:
Code
import sys
import pm4py, subprocess
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import textdistance
import gensim.corpora as corpora
from tqdm import trange, tqdm
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from yellowbrick.cluster import KElbowVisualizer
from pm4py.objects.petri_net.obj import PetriNet, Marking
from pm4py.visualization.decisiontree import visualizer as tree_visualizer
from pm4py.algo.decision_mining import algorithm as decision_mining
from pm4py.visualization.petri_net import visualizer
from gensim.models import LdaModel
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import duckdb
import re
import loggingThe R libraries used are:
Code
library(tidyverse)
library(lubridate)
library(jsonlite)
library(ggplot2)
library(bupaR)
library(processmapR)
library(dplyr)
library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
library(here)
library(survival)
library(lcmm)
library(coxme)
library(muhaz)
library(ggfortify)
library(bayestestR)
library(purrr)
library(duckdb)
library(logger)
library(finalfit)
library(flextable)
library(knitr)Data preprocessing
First of all, it is important to prepare correctly the data for the analysis. The next function creates some sql views that are going to be useful later:
Code
def general_preprocess(con):
'''
Generating views
Args:
con : db connector variable
'''
con.sql("DROP VIEW IF EXISTS cmbd_incidents_view;")
con.sql("DROP VIEW IF EXISTS comorb_incidents_view;")
con.sql("DROP VIEW IF EXISTS ss_use_incidents_view;")
con.sql("DROP VIEW IF EXISTS param_incidents_view_;")
con.sql("DROP VIEW IF EXISTS param_incidents_view;")
con.sql("DROP VIEW IF EXISTS param_cat_incidents_view;")
con.sql("DROP VIEW IF EXISTS patient_incidents_view;")
con.sql("DROP VIEW IF EXISTS treat_incidents_view;")
con.sql("DROP VIEW IF EXISTS exams_incidents_view;")
con.sql("CREATE VIEW cmbd_incidents_view AS SELECT * FROM main.cmbd \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE VIEW ss_use_incidents_view AS SELECT * FROM main.ss_use \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE VIEW comorb_incidents_view AS SELECT * FROM main.comorb \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE VIEW param_incidents_view_ AS SELECT DISTINCT * FROM main.param \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01') \
AND (param_value!=0 OR param_name!='bmi')")
con.sql("CREATE VIEW param_incidents_view AS \
SELECT DISTINCT patient_id,param_name,MAX(param_value) AS param_value,param_date\
FROM param_incidents_view_ GROUP BY patient_id,param_name,param_date")
con.sql("CREATE VIEW param_cat_incidents_view AS SELECT DISTINCT * FROM main.param_cat \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE VIEW patient_incidents_view AS SELECT *,\
FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25) \
AS 'age', \
FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25) \
AS 'age_dx', \
DATE_ADD(month_nac, INTERVAL 75 YEAR) AS 'turn_to_75', \
DATE_ADD(month_nac, INTERVAL 65 YEAR) AS 'turn_to_65' \
FROM main.patient WHERE patient_id IN (SELECT patient_id \
FROM main.patient WHERE dx_date >='2017-01-01')")
con.sql("CREATE VIEW treat_incidents_view AS SELECT * FROM main.treat \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("CREATE VIEW exams_incidents_view AS SELECT * FROM main.exams \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01')")
con.sql("DROP VIEW IF EXISTS param_incidents_view_age_date;")
con.sql("CREATE VIEW param_incidents_view_age_date AS SELECT param_incidents_view.patient_id, \
param_name, param_value, param_date, turn_to_65, turn_to_75 FROM \
param_incidents_view LEFT JOIN patient_incidents_view ON \
param_incidents_view.patient_id = patient_incidents_view.patient_id")
con.sql("DROP VIEW IF EXISTS cmbd_incidents_postdx_view")
con.sql("CREATE VIEW cmbd_incidents_postdx_view AS \
SELECT * FROM cmbd_incidents_view \
WHERE admission_date > (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = cmbd_incidents_view.patient_id)")
con.sql("DROP VIEW IF EXISTS cmbd_incidents_postdx_first_view")
con.sql("CREATE VIEW cmbd_incidents_postdx_first_view AS \
SELECT patient_id, admission_date, diagnosis1_code FROM (SELECT *,\
ROW_NUMBER() OVER(PARTITION BY patient_id \
ORDER BY admission_date) AS rn \
FROM cmbd_incidents_postdx_view) t WHERE rn = 1;")
con.sql("DROP VIEW IF EXISTS param_incidents_predx_view;")
con.sql("CREATE VIEW param_incidents_predx_view AS SELECT * FROM (\
SELECT * FROM (SELECT patient_id, param_name, param_value, param_date \
FROM param_incidents_view) sub \
WHERE param_date <= (SELECT dx_date FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id));")
con.sql("DROP VIEW IF EXISTS param_incidents_predx_last_view;")
con.sql("CREATE VIEW param_incidents_predx_last_view AS \
SELECT patient_id, param_name, param_date, param_value \
FROM (SELECT patient_id, param_name, param_date, param_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
ORDER BY param_date DESC) AS rn \
FROM param_incidents_predx_view \
WHERE param_value!=9999) t WHERE rn = 1;")
con.sql("DROP VIEW IF EXISTS param_cat_incidents_predx_view;")
con.sql("CREATE VIEW param_cat_incidents_predx_view AS SELECT * FROM (\
SELECT * FROM (SELECT patient_id, param_cat_name, param_cat_value, param_cat_date \
FROM param_cat_incidents_view) sub \
WHERE param_cat_date <= (SELECT dx_date FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id));")
con.sql("DROP VIEW IF EXISTS param_cat_incidents_predx_last_view;")
con.sql("CREATE VIEW param_cat_incidents_predx_last_view AS \
SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
ORDER BY param_cat_date DESC) AS rn \
FROM param_cat_incidents_predx_view) t WHERE rn = 1;")
con.sql("DROP VIEW IF EXISTS param_prevalents_pre_view;")
con.sql("CREATE VIEW param_prevalents_pre_view AS \
SELECT DISTINCT patient_id, param_name, param_value, param_date FROM main.param \
WHERE param_date < '2017-01-01' AND patient_id IN (\
SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
con.sql("DROP VIEW IF EXISTS param_prevalents_pre_last_view;")
con.sql("CREATE VIEW param_prevalents_pre_last_view AS \
SELECT patient_id, param_name, param_date, param_value \
FROM (SELECT patient_id, param_name, param_date, param_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
ORDER BY param_date DESC) AS rn \
FROM param_prevalents_pre_view \
WHERE param_value!=9999) t WHERE rn = 1;")
con.sql("DROP VIEW IF EXISTS param_cat_prevalents_pre_view;")
con.sql("CREATE VIEW param_cat_prevalents_pre_view AS \
SELECT DISTINCT * FROM main.param_cat \
WHERE param_cat_date < '2017-01-01' AND patient_id IN (\
SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
con.sql("DROP VIEW IF EXISTS param_cat_prevalents_pre_last_view;")
con.sql("CREATE VIEW param_cat_prevalents_pre_last_view AS \
SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
ORDER BY param_cat_date DESC) AS rn \
FROM param_cat_prevalents_pre_view) t WHERE rn = 1;") Treatments’ event log creation
For using process mining an event log is needed. The sort of functions below take data model’s treatment and parameter tables to create an event log of glycated hemoglobin measures and treatment of patients. Glycated hemoglobin measurement events are divided into two different states, those that have a value smaller than “L” and the others. L’s value depends on patient’s diabetes duration, age and comorbilities and complications. Therefore with the help of expert clinicians we decided that L would take the following values:
- L=7 if age<75 and no cardiovascular disease, heart failure, chronic kidney disease or fragility.
- L=8 if age<65 and any cardiovascular disease, heart failure, chronic kidney disease or fragility.
- L=8.5 if else.
As they are measures these events do not have any duration. In the case of treatments on the other hand a duration period exists. For treatments, events definition is based on drugs prescriptions if exists or dispensing dates and a fixed period time if else. Functions below make event logs with the previous considerations. Treatment analysis of this document is thought to do by the predominant clinical condition of patients according to DM2 treatment algorithm. In other words, patients are grouped by their predominant clinical condition and the analysis is realized independently to each group.
Code
logging.basicConfig(level=logging.INFO)
def separate_patients_by_condition(con):
'''
Filtering data by predominant clinical condition according to redGDPS
Args:
con : db connector variable
Return:
presc_data_p (float): percentage of not null data in prescription dates
https://www.sciencedirect.com/science/article/pii/S1751991821000176?via%3Dihub#tbl0015
https://www.sanidad.gob.es/estadEstudios/estadisticas/estadisticas/estMinisterio/SIAP/map_cie9mc_cie10_ciap2.htm
obesity (BMI≥30 kg/m2), age older than 75, established CVD (defined as
myocardial infarction, ischemic heart disease, cerebrovascular disease,
or peripheral arterial disease), CKD (defined as eGFR < 60 ml/min/
1.73 m2 and/or UACR ≥ 30 mg/g) and HF.
'''
cvd_code_list = ['K76','K75','K76','K91','K92']
hf = 'K77'
ckd = 'U99.01'
ob = "T82"
to_list = lambda l: "('"+"','".join(l)+"')"
presc_data_p = con.sql(f" SELECT (COUNT(prescription_date_ini)) / COUNT(*) \
AS p FROM treat_incidents_view").fetchall()[0][0]
cie9_df = pd.read_csv('./CIE9.csv').rename(columns={'CIE9':'CIE',
'BDCAP':'CIAP'})
cie9_df['comorb_codif'] = 'ICD9'
cie10_df = pd.read_csv('./CIE10.csv').rename(columns={'CIE10':'CIE',
'BDCAP':'CIAP'})
cie10_df['comorb_codif'] = 'ICD10'
cie2ciap = pd.concat([cie9_df[['comorb_codif','CIE','CIAP']],
cie10_df[['comorb_codif','CIE','CIAP']]])
con.sql("DROP VIEW IF EXISTS comorb_incidents_ciap_view;")
con.sql("DROP VIEW IF EXISTS comorb_incidents_active_view;")
con.sql("CREATE VIEW comorb_incidents_ciap_view AS \
SELECT * FROM comorb_incidents_view \
LEFT JOIN cie2ciap \
ON comorb_incidents_view.comorb_codif = cie2ciap.comorb_codif \
AND comorb_incidents_view.comorb_code = cie2ciap.CIAP")
con.sql("CREATE VIEW comorb_incidents_active_view AS SELECT *, \
CASE WHEN comorb_codif = 'CIAP2' THEN comorb_code ELSE CIAP END AS CIAP2 \
FROM comorb_incidents_ciap_view WHERE comorb_date_end IS NULL;")
# Which is the predominal clinical condition at baseline?
f_list = set()
ob_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view WHERE CIAP2 = '{ob}'"
).df()['patient_id'])
ckd_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view WHERE CIAP2 = '{ckd}'"
).df()['patient_id'])
hf_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view WHERE CIAP2 = '{hf}'"
).df()['patient_id'])
cvd_list = set(con.sql(f"SELECT DISTINCT patient_id \
FROM comorb_incidents_active_view \
WHERE CIAP2 IN {to_list(cvd_code_list)}"
).df()['patient_id'])
# Which is the predominal clinical condition at dx_date?
# ENFERMEDADES ISQUÉMICAS CARDIACAS (I20-I25)
# ENFERMEDADES CEREBROVASCULARES (I60-I69)
# ATEROESCLEROSIS (ENFERMEDADES VASCULARES PERIFÉRICAS) (I70)
# OTRAS ENFERMEDADES VASCULARES PERIFÉRICAS (I73)
# EMBOLIA Y TROMBOSIS ARTERIAL (ENFERMEDADES VASCULARES PERIFÉRICAS) (I74)
to_cvd_change_prelist = ['I2%i' %i for i in range(6)]+[
'I6%i' %i for i in [1,3,4,5,6,7,8,9]]+[
'I7%i' %i for i in [0,3,4]]
to_cvd_change_list = []
for code in to_cvd_change_prelist:
for c in cie10_df['CIE']:
if code in c and c not in ['I51.4','I60','I62','I67.1','I68.2','I67.5']:
to_cvd_change_list.append(c)
# INSUFICIENCIA CARDÍACA (I50)
# ENFERMEDAD CARDÍACA Y RENAL CRÓNICA HIPERTENSIVA (I13)
to_hf_change_list = ['I09.81', 'I11.0', 'I13.0', 'I13.2', 'I50', 'I50.0',
'I50.1', 'I50.9']
# ENFERMEDAD RENAL CRÓNICA (N18)
# ENFERMEDAD RENAL CRÓNICA HIPERTENSIVA (I12)
to_ckd_change_list = ['N18','N18.1','N18.2','N18.3','N18.30','N18.31',
'N18.32','N18.4','N18.5','N18.6','N18.9','I12',
'I12.0','I12.9','filtglom']
to_f_change_list = ['f_date']
to_ob_change_list = ['bmi>=30','E66.01','E66.09','E66.1','E66.2',
'E66.8','E66.9']
# SOBREPESO Y OBESIDAD (E66)
from_cvd_change_list = ['death']
from_hf_change_list = from_cvd_change_list+to_cvd_change_list
from_ckd_change_list = from_hf_change_list+to_hf_change_list
from_f_change_list = from_ckd_change_list+to_ckd_change_list
from_ob_change_list = from_f_change_list+to_f_change_list+['bmi<30']
from_else_change_list = from_ob_change_list[:-1]+to_ob_change_list
relevance_order = pd.DataFrame()
relevance_order['code'] = from_ob_change_list+to_ob_change_list+['end']
relevance_order['order'] = range(len(relevance_order))
# https://ukkidney.org/health-professionals/information-resources/uk-eckd-guide/ckd-stages
# Include obesity and kcd detection by parameters values
param_filt = con.sql("SELECT * FROM param_incidents_view \
WHERE param_name = 'filtglom' AND param_value<60;"
).df().sort_values(['patient_id','param_date'])
param_filt['time_diff'] = param_filt.groupby(
'patient_id')['param_date'].diff()
cmbd_filt = param_filt[param_filt['time_diff'] < pd.Timedelta(days=90)
].drop_duplicates(subset='patient_id',keep='first')[
['patient_id','param_date','param_name']
].rename(columns={
'param_name':'code',
'param_date':'admission_date'})
con.sql("DROP VIEW IF EXISTS cmbd_incidents_long_view;")
con.sql("CREATE VIEW cmbd_incidents_long_view AS SELECT *,\
COALESCE(t.order, 404) AS relevance FROM\
(SELECT patient_id, admission_date, diagnosis1_code AS code \
FROM cmbd_incidents_view WHERE diagnosis1_code IS NOT NULL \
UNION ALL SELECT patient_id, admission_date, diagnosis2_code AS code \
FROM cmbd_incidents_view WHERE diagnosis2_code IS NOT NULL \
UNION ALL SELECT patient_id, admission_date, diagnosis3_code AS code \
FROM cmbd_incidents_view WHERE diagnosis3_code IS NOT NULL \
UNION ALL SELECT patient_id, admission_date, code FROM cmbd_filt \
UNION ALL SELECT patient_id, turn_to_75 AS admission_date, 'f_date' \
AS code FROM patient_incidents_view WHERE turn_to_75<'2023-01-01'\
UNION ALL SELECT patient_id, death_date AS admission_date, 'death' \
AS code FROM patient_incidents_view WHERE death_date IS NOT NULL\
UNION ALL SELECT patient_id, DATE '2023-01-01' AS admission_date, 'end' \
AS code FROM patient_incidents_view\
UNION ALL SELECT patient_id,param_date AS admission_date, \
CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
FROM param_incidents_predx_last_view WHERE param_name = 'bmi' \
UNION ALL SELECT patient_id, param_date AS admission_date, \
CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
FROM param_incidents_view sub WHERE param_name = 'bmi' \
AND param_date > (SELECT dx_date FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id)) v \
LEFT JOIN relevance_order t ON v.code=t.code;")
f_list.update(set(con.sql("SELECT DISTINCT patient_id \
FROM patient_incidents_view WHERE turn_to_75<=dx_date"
).df()['patient_id']))
con.sql("DROP VIEW IF EXISTS ob_events;")
con.sql(f"CREATE VIEW ob_events AS SELECT * \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_ob_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
)
ob_list.update(set(con.sql("SELECT DISTINCT patient_id \
FROM ob_events").df()['patient_id']))
ob_list = ob_list - set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code='bmi<30' \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id) \
AND admission_date > (SELECT MAX(admission_date)\
FROM ob_events \
WHERE ob_events.patient_id = sub.patient_id);"
).df()['patient_id'])
ckd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_ckd_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
).df()['patient_id']))
hf_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_hf_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
).df()['patient_id']))
cvd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
FROM cmbd_incidents_long_view sub \
WHERE code IN {to_list(to_cvd_change_list)} \
AND admission_date <= (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);"
).df()['patient_id']))
cvd_list = list(cvd_list)
hf_list = list(hf_list.difference(cvd_list))
ckd_list = list(ckd_list.difference(cvd_list+hf_list))
f_list = list(f_list.difference(cvd_list+hf_list+ckd_list))
ob_list = list(ob_list.difference(cvd_list+hf_list+ckd_list+f_list))
else_list = list(set(con.sql("SELECT DISTINCT patient_id \
FROM patient_incidents_view"
).df()['patient_id']
).difference(cvd_list+hf_list+ckd_list+f_list+ob_list))
# When is a change on predominant clinical condition made?
con.sql("DROP VIEW IF EXISTS cmbd_incidents_long_view_dx")
con.sql(f"CREATE VIEW cmbd_incidents_long_view_dx AS SELECT * \
FROM cmbd_incidents_long_view sub \
WHERE admission_date > (SELECT dx_date \
FROM patient_incidents_view \
WHERE patient_incidents_view.patient_id = sub.patient_id);")
query_function = lambda cond: "(WITH rankedevents_*** AS (\
SELECT patient_id,code,admission_date,ROW_NUMBER() \
OVER(PARTITION BY patient_id ORDER BY admission_date,relevance) AS event_rank \
FROM cmbd_incidents_long_view_dx WHERE patient_id IN {to_list(***_list)} \
AND code IN {to_list(from_***_change_list)} ) \
SELECT patient_id, '***' AS type,code, admission_date AS date \
FROM rankedevents_*** WHERE event_rank=1 AND date<'2023-01-01') \
UNION ALL (WITH rankedevents__*** AS (\
SELECT patient_id,code,admission_date,ROW_NUMBER() \
OVER(PARTITION BY patient_id ORDER BY admission_date) AS event_rank \
FROM cmbd_incidents_long_view_dx WHERE patient_id IN {to_list(***_list)} \
AND code IN {to_list(['end']+from_***_change_list)} ) \
SELECT patient_id, '***' AS type,code, admission_date AS date \
FROM rankedevents__*** WHERE event_rank=1 AND code='end')\
".replace('***',cond)
con.sql("DROP TABLE IF EXISTS patient_condition;")
con.sql(eval('f"CREATE TABLE patient_condition AS {}"'.format(' UNION ALL '.join(
[query_function(cond) for cond in ['cvd','hf','ckd','f','ob','else']]))))
return presc_data_p
def evlog_creation_by_prescriptions(con,
cond,
code2drug_info_path='./diabetes_drugs.csv'):
'''Preprocessing and event log obtention with prescriptions
Args:
con : db connector variable
cond (str): predominal clinical condition's code
code2drug_info_path (str): drugs' and their codes' info's table's path
ADNI: ANTIDIABETICOS NO INSULINICOS
The treatment of type 2 diabetes mellitus with ADNI includes a wide range of
drugs which, depending on their drugs which, according to their mechanisms of
action, can be grouped as follows:
Increased endogenous insulin sensitivity:
o Biguanides: metformin (MET).
o Thiazolidinediones: pioglitazone (PIO).
Increased endogenous insulin secretion/release:
o Sulfonylureas (SU).
o Meglitinides: repaglinide (REP)
Reduction of digestive glucose absorption:
o Alpha-glucosidase inhibitors.
o Vegetable fiber and derivatives.
Incretin effect enhancers.
o Inhibitors of DPP-4 (iDPP-4).
o GLP-1 analogues (aGLP-1).
Inhibitors of renal glucose reuptake
o SGLT-2 inhibitors (iSGLP-2)
'''
con.sql(f"DROP TABLE IF EXISTS evlog_raw_{cond}")
fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
WHERE type = '{cond}'").df()
dx_date = con.sql(f"SELECT patient_id, dx_date FROM patient_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}')").df()
dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
fin_date = dict(zip(fin_date.patient_id,fin_date.date))
code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
).get('class','NONE2'
).replace('+','&')
treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND substring(atc_code,1,3) = 'A10'").df()
treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])
].drop_duplicates(subset=['patient_id',
'prescription_date_ini',
'prescription_date_end',
'Event'])
treat_df['actins'] = range(len(treat_df))
treat_df_start = treat_df[['patient_id','prescription_date_ini',
'Event','actins']].rename(columns =
{'prescription_date_ini':'date'})
treat_df_end = treat_df[['patient_id','prescription_date_end',
'Event','actins']].rename(columns =
{'prescription_date_end':'date'})
treat_df_start['cycle'] = 'start'
treat_df_end['cycle'] = 'end'
treat_df_end['date'] = treat_df_end['date'].fillna(
datetime.strptime('2022-12-31', "%Y-%m-%d")).apply(
lambda x: min([x,datetime.strptime('2023-01-01', "%Y-%m-%d")])).apply(
lambda x: x-timedelta(days=1))
act_n = max(treat_df['actins'])
if cond in ['cvd','hf','ckd','f']:
param_df = con.sql(f"SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {8} THEN 'HBA<L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date < turn_to_65 THEN 'HBA>L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date >= turn_to_65 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event,\
ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL").df()
else:
param_df = con.sql(f"SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {7} THEN 'HBA<L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date < turn_to_75 THEN 'HBA>L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date >= turn_to_75 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event,\
ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL").df()
param_df_start = param_df.copy()
param_df_start['cycle'] = 'start'
param_df_end = param_df.copy()
param_df_end['cycle'] = 'end'
df = pd.concat([param_df_start,treat_df_start,param_df_end,treat_df_end]
).drop_duplicates()
df = df.sort_values(by = ['patient_id','date','Event','cycle'],
ascending = [True,True,True,False])
df.index = range(len(df))
patient_list = list(set(df['patient_id']))
df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))]
#####################################################################
event_log = dict()
event_log['patient_id'] = []
event_log['date'] = []
event_log['nid'] = []
event_log['Event'] = []
event_log['cycle'] = []
days_after_m = 5
id_list = list(set(df['patient_id']))
events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
row = len(events)
no_hba = [i for i in range(row) if i not in [hba0,hba1]]
for id in tqdm(id_list):
df_id = df[df['patient_id']==id]
nid = id_list.index(id)
actins = set(df_id['actins'])
date_min = min([min(set(df_id['date'])),dx_date[id]])
date_max = max(set(df_id['date']))
if str(fin_date[id])!='nan':
date_max = max([date_max,fin_date[id]])
else:
date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
dd = [date_min + timedelta(days=x) for x in range(
(date_max-date_min).days + 1)]
col = len(dd)
ev_status = np.zeros((row,col))
for act in actins:
ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]
ev = list(df_id['Event'][df_id['actins']==act])[0]
fin = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='end')])[0]
ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(ini),
dd.index(fin)+1)),1)
measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
ev_status[hba1,:].nonzero()[0])))
ev_status = ev_status.astype(int)
#delete treatments in measure events
ev_status = change_matrix_values(ev_status,no_hba,measures,0)
#correction to eliminate events after hemoglobin measurement that are
#the continuation of the previous treatment and not the new treatment.
#Example: If a patient has 'A' treatment and because of hemoglobin's
# measure they change their treatment to 'B', originally their
# trace could appear as A>measure>A>B, but the true trace
# should be A>measure>B. Therefore, in a fixed period time
# after each measurement 'days_after_m', if we detect that
# mistake we delete it.
for j in measures:
if ev_status.shape[1]<=j+1 or days_after_m<3:
continue
first_col = ev_status[:,j+1]
dif_col = np.where(np.any(ev_status[:, j+2:j+days_after_m]!=
first_col[:, np.newaxis],
axis=0))[0]
if (dif_col.size > 0) and (not
np.array_equal(ev_status[:,j-1],
ev_status[:,j+dif_col[0]+2])) and (
np.array_equal(ev_status[:,j-1],
ev_status[:,j+1])):
ev_status = change_matrix_values(ev_status,no_hba,
range(j+1,j+dif_col[0]+2),0)
col = dd.index(dx_date[id])
col_0 = dd.index(dx_date[id])
col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
min_change_days = 7
while col<col_max-1:
if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
ev = col2treat(ev_status,events,col_0)
if ('HBA' not in ev and col-col_0<min_change_days):
col+=1
col_0 = col
continue
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
col+=1
col_0 = col
else:
col+=1
ev = col2treat(ev_status,events,col_0)
if 'HBA' in ev or col-col_0>=min_change_days:
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
evlog = pd.DataFrame.from_dict(event_log)
evlog = evlog.sort_values(['patient_id','date'])
evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
evlog.index = range(len(evlog))
evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
evlog['actins'] = [n//2 for n in range(len(evlog))]
con.sql(f"CREATE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
def evlog_creation_by_dispensations(con,
cond,
code2drug_info_path='./diabetes_drugs.csv',
nac_path='./Nomenclator_de_Facturacion.csv'):
'''Preprocessing and event log obtention with dispensations
Args:
con : db connector variable
cond (str): predominal clinical condition's code
code2drug_info_path (str): drugs' and their codes' info's table's path
nac_path (str): drugs' and their containers' info's table's path
ADNI: ANTIDIABETICOS NO INSULINICOS
The treatment of type 2 diabetes mellitus with ADNI includes a wide range of
drugs which, depending on their drugs which, according to their mechanisms of
action, can be grouped as follows:
Increased endogenous insulin sensitivity:
o Biguanides: metformin (MET).
o Thiazolidinediones: pioglitazone (PIO).
Increased endogenous insulin secretion/release:
o Sulfonylureas (SU).
o Meglitinides: repaglinide (REP)
Reduction of digestive glucose absorption:
o Alpha-glucosidase inhibitors.
o Vegetable fiber and derivatives.
Incretin effect enhancers.
o Inhibitors of DPP-4 (iDPP-4).
o GLP-1 analogues (aGLP-1).
Inhibitors of renal glucose reuptake
o SGLT-2 inhibitors (iSGLP-2)
'''
con.sql(f"DROP TABLE IF EXISTS evlog_raw_{cond}")
treat_min_days = 2
days_error = 60-treat_min_days
days_after_m = 7
fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
WHERE type = '{cond}'").df()
dx_date = con.sql(f"SELECT patient_id, dx_date FROM patient_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}')").df()
dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
fin_date = dict(zip(fin_date.patient_id,fin_date.date))
code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
).get('class','NONE2'
).replace('+','&')
treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND substring(atc_code,1,3) = 'A10'").df()
treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])]
nac_dict = nac2comprimidos(nac_path,set(treat_df['nac_code']))
treat_df['nac_code'] = treat_df['nac_code'].apply(
lambda x: nac_dict.get(x,days_error))
treat_df['containers_number'] = treat_df['containers_number'].fillna(1)
treat_df['nac_code'] = treat_df['nac_code']*treat_df['containers_number']
treat_df = treat_df[['patient_id','dispensing_date','Event','nac_code']
].rename(columns = {'dispensing_date':'date'}
).sort_values(by = ['patient_id','date' ],
ascending = [True, True])
if cond in ['cvd','hf','ckd','f']:
df = con.sql(f"SELECT *, 'start' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
FROM (SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {8} THEN 'HBA<L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date < turn_to_65 THEN 'HBA>L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date >= turn_to_65 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
UNION ALL SELECT * FROM treat_df) \
UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
FROM (SELECT patient_id,param_date AS date, param_date,\
CASE WHEN param_value < {8} THEN 'HBA<L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date < turn_to_65 THEN 'HBA>L' \
WHEN param_value >= {8} AND param_value < {8.5} \
AND param_date >= turn_to_65 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
UNION ALL SELECT patient_id,\
CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
DAYS AS date, date AS param_date, \
Event, nac_code FROM treat_df)").df()
else:
df = con.sql(f"SELECT *, 'start' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
FROM (SELECT patient_id,param_date AS date, \
CASE WHEN param_value < {7} THEN 'HBA<L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date < turn_to_75 THEN 'HBA>L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date >= turn_to_75 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
UNION ALL SELECT * FROM treat_df) \
UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
FROM (SELECT patient_id,param_date AS date, param_date,\
CASE WHEN param_value < {7} THEN 'HBA<L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date < turn_to_75 THEN 'HBA>L' \
WHEN param_value >= {7} AND param_value < {8.5} \
AND param_date >= turn_to_75 THEN 'HBA<L' \
ELSE 'HBA>L' END AS Event, 0 AS nac_code\
FROM param_incidents_view_age_date \
WHERE patient_id IN (SELECT patient_id \
FROM patient_condition \
WHERE patient_condition.type = '{cond}') \
AND param_name = 'hba1c' \
AND param_value IS NOT NULL \
UNION ALL SELECT patient_id,\
CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
DAYS AS date, date AS param_date, \
Event, nac_code FROM treat_df)").df()
patient_list = list(set(df['patient_id']))
df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))]
df = df.sort_values(by = ['patient_id', 'actins','cycle'],
ascending = [True, True, False])
df.index = range(len(df))
#####################################################################
event_log = dict()
event_log['patient_id'] = []
event_log['date'] = []
event_log['nid'] = []
event_log['Event'] = []
event_log['cycle'] = []
id_list = list(set(df['patient_id']))
events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
row = len(events)
no_hba = [i for i in range(row) if i not in [hba0,hba1]]
for id in tqdm(id_list):
df_id = df[df['patient_id']==id]
nid = id_list.index(id)
actins = set(df_id['actins'])
date_min = min([min(set(df_id['date'])),dx_date[id]])
date_max = max(set(df_id['date']))
if str(fin_date[id])!='nan':
date_max = max([date_max,fin_date[id]])
else:
date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
dd = [date_min + timedelta(days=x) for x in range(
(date_max-date_min).days + 1)]
col = len(dd)
ev_status = np.zeros((row,col))
max_nac = int(max(df_id['nac_code'])*1.2)
#event matrix where columns are days and rows treatments
#0:no treatment, 1:treatment, 2:posible treatment,
#3:after measuring period
for act in actins:
ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]
ev = list(df_id['Event'][df_id['actins']==act])[0]
fin = ini+timedelta(days=treat_min_days) if 'HBA' not in ev else ini
ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(ini),
dd.index(fin)+1)),1)
if 'HBA' not in ev:
possible_day_duration = int(list(
df_id['nac_code'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]*1.2)
until_day = min([dd.index(ini)+possible_day_duration,col])
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(fin),
until_day)),2)
measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
ev_status[hba1,:].nonzero()[0])))
#event compaction.
#Example: If a patient has 'A' treatment and they are dispensing 'A'
# drug multiple times, their trace could appear as A>A>A>A,
# but the true trace should appear as 'A' treatment lasting
# its corresponding time. Therefore, if the difference between
# the last day and the first day of two consecutive same drugs's
# dispensation is less than the number of tablets of the
# container two events are jointed in one with the first day of
# the first event as initial date and last day of the last
# event as the end date.
ev_status = ev_status.astype(int)
for i in no_hba:
seq = str(list(ev_status[i,:]))
for delta_days in range(1,max_nac):
pattern = ', '+str([1]+[2]*delta_days+[1])[1:-1]
new_pattern = ', '+str([1]+[1]*delta_days+[1])[1:-1]
seq = seq.replace(pattern,new_pattern)
ev_status[i,:] = eval(seq)
ev_status = change_matrix_values(ev_status,no_hba,measures,3)
for i,j in list(itertools.combinations(no_hba,2)):
seq_i = str(list(ev_status[i,:]))
seq_j = str(list(ev_status[j,:]))
for delta_days in range(1,max_nac):
pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
if re.search(pattern_i,seq_i)==None:
continue
pattern_j = ', '+str([1]+[2]*delta_days+[3])[1:-1]
new_pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
seq_j = seq_j.replace(pattern_j,new_pattern_j)
for delta_days in range(1,max_nac):
pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
if re.search(pattern_j,seq_j)==None:
continue
pattern_i = ', '+str([1]+[2]*delta_days+[3])[1:-1]
new_pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
seq_i = seq_i.replace(pattern_i,new_pattern_i)
ev_status[i,:] = eval(seq_i)
ev_status[j,:] = eval(seq_j)
measures_w_dp = [[i+d_p for d_p in range(days_after_m+1)
if i+d_p<col] for i in measures]
measures_w_dp = list(itertools.chain(*measures_w_dp))
ev_status = change_matrix_values(ev_status,no_hba,measures_w_dp,3)
for i in no_hba:
seq = str(list(ev_status[i,:]))
seq = seq.replace('2','0').replace('3','0')
ev_status[i,:] = eval(seq)
col = dd.index(dx_date[id])
col_0 = dd.index(dx_date[id])
col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
min_change_days = 42
while col<col_max-1:
if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
ev = col2treat(ev_status,events,col_0)
if (ev=='_' and col-col_0<min_change_days):
col+=1
col_0 = col
continue
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
col+=1
col_0 = col
else:
col+=1
ev = col2treat(ev_status,events,col_0)
if ev!='_' or col-col_0>=min_change_days:
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
evlog = pd.DataFrame.from_dict(event_log)
evlog = evlog.sort_values(['patient_id','date'])
evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
evlog.index = range(len(evlog))
evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
evlog['actins'] = [n//2 for n in range(len(evlog))]
con.sql(f"CREATE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
def change_matrix_values(matrix,list_rows,list_cols, new_value=0):
'''change selected rows' and columns' matrix`s values
Args:
matrix (array): matrix wanted to modify
list_rows (list): matrix's rows wanted to modify
list_cols (list): matrix's columns wanted to modify
new_value (int): new wanted value
Output:
matrix (array): modified matrix
'''
pos = list(itertools.product(list_rows,list_cols))
for i,j in pos:
matrix[i,j] = new_value
return matrix
def col2treat(matrix,rows,col):
'''translate treatments from a binary vector
Args:
matrix (array): events calendary in binary information
rows (str): events list
col (int): column of matrix wanted to translate
Output:
treatment (str):
'''
treatment = '+'.join(sorted([rows[ev
] for ev in range(len(rows))
if matrix[ev,col]!=0]))
if treatment!='':
return treatment
else:
return '_'
def nac2comprimidos(nac_path,code_set):
'''Creating dictionary of nac drugs container's number of tablets
Args:
nac_path (str): nac table's path
code_set (set): nac code list
Outputs:
nac_dict (dic): nac codes as keys and their number of tablets as values
'''
nac = pd.read_csv(nac_path)
nac['Código Nacional'] = nac['Código Nacional'].astype(str)
nac_dict = {}
for n in range(len(nac)):
if nac['Código Nacional'][n] in code_set:
text = nac['Nombre del producto farmacéutico'][n].lower()
if re.search(r'\d+(?=\s*comprim)', text)!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'\d+(?=\s*comprim)', text).group())
elif re.search(r'\d+(?=\s*comprim)',
re.sub(r'\([^()]*\)', '', text))!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'\d+(?=\s*comprim)',
re.sub(r'\([^()]*\)', '', text)).group())
elif re.search(r'con pelicula (\d+)', text)!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'con pelicula (\d+)',
text).group().split()[-1])
return nac_dictProcess indicators’ event log creation
With the objective of studying process indicators a function to make a process indicators’ event log is coded above. There, in addition to laboratory measures and realized exams, three artificial events have been included to aid the analysis.’INI’ event denotes each patient’s diabetes diagnosis date, ‘yFIN’ events indicate the end of each annual interval and ‘FIN’ event is the date of the end of the last completed annual interval.
Code
def evlog_process_ind(con):
'''Generation of process indicators' eventlog
Args:
con : db connector variable
'''
con.sql("DROP VIEW IF EXISTS pre_process_ind;")
con.sql("DROP TABLE IF EXISTS process_ind;")
con.sql("DROP TABLE IF EXISTS process_ind1;")
con.sql("DROP TABLE IF EXISTS process_ind2;")
con.sql("DROP TABLE IF EXISTS process_ind3;")
con.sql("DROP TABLE IF EXISTS process_ind4;")
con.sql("DROP TABLE IF EXISTS process_ind5;")
con.sql("CREATE VIEW pre_process_ind AS SELECT *,'start' AS cycle,\
ROW_NUMBER() OVER (ORDER BY patient_id,date) AS actins \
FROM (SELECT DISTINCT patient_id,'hba1c' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 60 SECOND AS date \
FROM param_incidents_view WHERE param_name = 'hba1c' \
UNION ALL SELECT DISTINCT patient_id,'col' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 120 SECOND AS date \
FROM param_incidents_view WHERE param_name IN ('col','hdl','ldl') \
UNION ALL SELECT DISTINCT patient_id,'blood_p' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 180 SECOND AS date \
FROM param_incidents_view WHERE param_name IN ('sbp','dbp') \
UNION ALL SELECT DISTINCT patient_id,'indalbcr' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 240 SECOND AS date \
FROM param_incidents_view WHERE param_name='indalbcr' \
UNION ALL SELECT DISTINCT patient_id,'filtglom' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 300 SECOND AS date \
FROM param_incidents_view WHERE param_name='filtglom' \
UNION ALL SELECT DISTINCT patient_id,'bmi' AS event, \
CAST(param_date AS TIMESTAMP) + INTERVAL 360 SECOND AS date \
FROM param_incidents_view WHERE param_name IN ('bmi','weight') \
UNION ALL SELECT DISTINCT patient_id,'ocular_exam' AS event, \
CAST(test_date AS TIMESTAMP) + INTERVAL 420 SECOND AS date \
FROM exams_incidents_view WHERE test_name='ocular_exam_PC' \
UNION ALL SELECT DISTINCT patient_id,'foot_exam' AS event, \
CAST(test_date AS TIMESTAMP) + INTERVAL 480 SECOND AS date \
FROM exams_incidents_view WHERE test_name='foot_exam' \
UNION ALL SELECT patient_id,'INI' AS event, \
CAST(dx_date AS TIMESTAMP) AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 365 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 730 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 1095 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 1461 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'yFIN' AS event,\
CAST(dx_date AS TIMESTAMP) + INTERVAL 1826 DAYS AS date \
FROM patient_incidents_view \
UNION ALL SELECT patient_id, 'FIN' AS event,\
CAST(COALESCE(deregistration_date,'2023-01-01') AS TIMESTAMP) AS date \
FROM patient_incidents_view)")
con.sql("CREATE TABLE process_ind AS SELECT a.* FROM pre_process_ind a \
JOIN (SELECT patient_id, MIN(date) as ini_date,MAX(date) as fin_date \
FROM pre_process_ind d WHERE event IN ('INI', 'yFIN') AND \
date<(SELECT c.date FROM pre_process_ind c \
WHERE c.event = 'FIN' AND c.patient_id = d.patient_id) \
GROUP BY patient_id) b ON a.patient_id = b.patient_id \
WHERE (a.date BETWEEN b.ini_date AND b.fin_date) OR a.event = 'FIN'")
con.sql("CREATE TABLE process_ind1 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 1 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE TABLE process_ind2 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 2 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE TABLE process_ind3 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 3 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE TABLE process_ind4 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 4 AND (A.date <= R.date OR A.event = 'FIN'))")
con.sql("CREATE TABLE process_ind5 AS SELECT * FROM \
(WITH RankedActions AS (SELECT patient_id, date, event,\
ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
FROM process_ind WHERE event = 'yFIN') \
SELECT DISTINCT A.* FROM process_ind A \
JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
R.rn = 5 AND (A.date <= R.date OR A.event = 'FIN'))")
Distances
One of the most important aim of process mining is to show and explain the processes. However, the great variety of traces does not allow us to draw any clear conclusions and it is often necessary to simplify our data. Another option that we can do before simplifying, to avoid the excessive losing of information and give another perspective to the analysis, is to cluster the traces and to analyze them by cluster. To do that we have to measure somehow the differences between the traces; the distance between them. There are some distances that we can use to this task: edit distances, vector term similarity, LDA based distances, embedding based distances… Some of them are shown below as functions to calculate the distance matrix of traces:
Code
####### EDIT DISTANCE #######
def calculate_dm_ED(traces,measure_f):
'''Calculate distance matrix with some edit distance.
Args:
traces (list): patients' traces
measure_f: some edit distance function
Returns:
dm: distance matrix
'''
id2word = corpora.Dictionary(traces)
traces_e = [[id2word.token2id[t[n]] for n in range(len(t))] for t in traces]
traces_e_str = list(set([str(traces_e[i]
) for i in range(len(traces_e))]))
len_t_r = len(traces_e_str)
len_t = len(traces_e)
dm = np.zeros((len_t,len_t), dtype = np.float32)
same = measure_f(traces_e[0],traces_e[0])
d_dic = {str(t):dict() for t in traces_e_str}
for i in range(len_t_r):
d_dic[traces_e_str[i]][traces_e_str[i]] = same
for j in range(i+1,len_t_r):
d_ij = measure_f(eval(traces_e_str[i]),
eval(traces_e_str[j]))
d_dic[traces_e_str[i]][traces_e_str[j]] = d_ij
d_dic[traces_e_str[j]][traces_e_str[i]] = d_ij
for i in range(len_t):
dm[i][i] = same
for j in range(i+1,len_t):
t_i = str(traces_e[i])
t_j = str(traces_e[j])
d_ij = d_dic[t_i][t_j]
dm[i][j] = d_ij
dm[j][i] = d_ij
if same == 1:
dm = 1 - dm
return dm
####### TERM VECTOR SIMILARITY #######
def calculate_dm_TV(traces):
'''Calculate distance matrix with term vector similarity.
Args:
traces (list): list of traces
Returns:
dm (array): distance matrix
vectorizer: TfidfVectorizer
X: traces vectorized with TfidVectorizer
'''
corpus = [' '.join(t) for t in traces]
vectorizer = TfidfVectorizer(tokenizer=str.split)
X = vectorizer.fit_transform(corpus)
print('calculatin dm ...')
dm = np.asarray(np.matmul(X.todense(),X.todense().T))
dm = 1 - dm.round(decimals=4)
return dm, vectorizer, X
####### LDA BASED DISTANCE #######
def calculate_dm_LDA(traces,T=10):
'''Calculate distance matrix with LDA model.
Args:
traces (list): list of traces
T (int): number of topics of LDA model
Returns:
dm (array): distance matrix
lda_model: LdaModel
id2word (dict): tokenized events as keys and events by values
'''
# Create Dictionary
id2word = corpora.Dictionary(traces)
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in traces]
# Make LDA model
lda_model = LdaModel(corpus=corpus,
id2word=id2word,
num_topics=T,
alpha = 1,
eta = 'auto',
random_state = 123)
get_c_topic = np.array(
lda_model.get_document_topics(corpus,minimum_probability = -0.1))
sigma = np.asarray([[get_c_topic[i][j][1]
for j in range(T)] for i in range(len(corpus))])
sigma2 = np.asarray(np.matmul(sigma,sigma.T))
len_t = len(traces)
dm = np.zeros((len_t,len_t), dtype = np.float32)
same = sigma2[0][0]/np.sqrt(sigma2[0][0]*sigma2[0][0])
for i in trange(len_t):
dm[i][i] = same
for j in range(i+1,len_t):
d_ij = sigma2[i][j]/np.sqrt(sigma2[i][i]*sigma2[j][j])
dm[i][j] = d_ij
dm[j][i] = d_ij
dm = 1-dm
return dm, lda_model, id2wordClustering
Once the distance matrix is obtained, we can proceed with the clustering. Each clustering method has its characteristics and its peculiarities. For example, we have to consider we have a distance matrix and not a data frame in which we apply directly the method. A hierarchical clustering algorithm seems to be a good choice in our case because in addition to the above it allows to choose the optimal number of clusters.
The next code box contains two different functions to choose the optimal number of clusters:
Code
def dendogram(dm,output_png='../../outputs/dendogram.png'):
'''Plot and save dendogram.
Args:
dm (array): distance matrix
output_png (str): saved dendogram's path
'''
dm_condensed = squareform(dm)
matrix = linkage(
dm_condensed,
method='complete'
)
sys.setrecursionlimit(10000000)
dn = dendrogram(matrix,truncate_mode='lastp',p=80)
sys.setrecursionlimit(1000)
plt.title('Dendrogram')
plt.ylabel('Distance')
plt.xlabel('Patients traces')
plt.savefig(output_png)
plt.clf()
def kelbow(dm,
elbow_metric='distortion',
locate_elbow=False,
output_path='../../outputs/'):
'''Plots to choose optimal clusters.
Args:
dm (array): distance matrix
elbow_metric (str): name of the method
locate_elbow (boolean): True if want to return optimal number of clusters
Returns:
k_opt (int)(optional): optimal number of clusters according to method
'''
model = AgglomerativeClustering(metric = "precomputed",
linkage = 'complete')
# k is range of number of clusters.
visualizer_ = KElbowVisualizer(model,
k=(2,25),
timings=False,
xlabel = 'cluster numbers',
metric=elbow_metric,
locate_elbow=locate_elbow)
# Fit data to visualizer
output_file = output_path+elbow_metric+'.png'
visualizer_.fit(dm)
# Finalize and render figure
#visualizer_.show(output_path=output_file,clear_figure=False)
visualizer_.show(output_path=output_file)
k_opt=None
if locate_elbow:
k_opt = visualizer_.elbow_value_
return k_optThe function ‘clust’ clusterizes traces in prefixed number of clusters:
Code
def clust(clust_n,dist_matrix,df_,id2trace,patients):
'''clusterize distance matrix.
Args:
clust_n (int): number of clusters obtained
dist_matrix (array): distance matrix
df_ (dataframe): event log
id2trace (dict): patient ids as keys and their traces as values
patients (list): patients' ids in same order as in dm
Returns:
df_ (dataframe): dataframe with patients and their clusters
'''
traces = list(id2trace[id] for id in sorted(id2trace.keys()))
model = AgglomerativeClustering(n_clusters=clust_n,
metric = "precomputed",
linkage = 'complete')
model.fit(dist_matrix)
labels = model.labels_
cluster_list ={id: labels[traces.index(id2trace[id])
] for id in patients}
df_['cluster'] = [cluster_list[df_['ID'][i]] for i in range(len(df_))]
return df_Descriptive analysis of treatments’ eventlog
The implementation below is made to show the most frequent traces in each cluster:
Code
def make_data_dict(log,top_k,col_id):
'''Obtain most frequent traces and their statistics
Args:
log (dataframe): event log
top_k (int): number of traces want to show
col_id (str): patients id column's name in df_
Returns:
data_dict (dict): traces as keys and ther statistics as values
'''
len_id = len(set(log[col_id]))
log_freq = pm4py.stats.get_variants(log)
freq_list = [(t,log_freq[t],len(t)) for t in set(log_freq.keys())]
trace = [list(t[0]) for t in sorted(freq_list, key=lambda x:
(len_id-x[1],x[2]))[:top_k]]
cases = [t[1] for t in sorted(freq_list, key=lambda x:
(len_id-x[1],x[2]))[:top_k]]
top_k = min(top_k,len(cases))
percentage = [100*cases[c]/len_id for c in range(top_k)]
cum_percentage = [sum(percentage[:p+1]) for p in range(top_k)]
data_dict = {"Trace": trace,
"Percentage": percentage,
"Cases": cases,
"Cumulative Percentage": cum_percentage}
return data_dict
def update_color_dict(color_dict,data_dict):
'''update of the color dict to include new events
Args:
color_dict (dict): events as keys and colors as values
data_dict (dict): traces as keys and ther statistics as values
Returns:
color_dict (dict): events as keys and colors as values
'''
cmap = plt.cm.get_cmap('tab20')
for event in set(itertools.chain.from_iterable(data_dict['Trace'])):
if event not in color_dict and len(color_dict)==20:
cmap = plt.cm.get_cmap('tab20b')
if event not in color_dict:
try:
color_dict.update({event:cmap(len(color_dict))})
except:
color_dict.update({event:cmap(2*(len(color_dict)-20))})
return color_dict
def trace_plotter(data_dict,
color_dict,
acronym,
output_file,
font_size=10,
percentage_box_width=0.8,
size=(15,9)):
'''configuration of the trace_explorer plot
Args:
color_dict (dict): events as keys and colors as values
data_dict (dict): traces as keys and their statistics as values
acronym (dict): events as keys and their acronyms as values
output_file (str): figure's path
font_size (int): font size
percentage_box_width (float): event boxes' width
size (tuple): figure's size
'''
fig, ax = plt.subplots(figsize=size)
percentage_position = max(len(t) for t in data_dict["Trace"]
) + percentage_box_width*3 +0.5
for row, (trace, percentage,cases,cum_percentage
) in enumerate(zip(data_dict["Trace"],
data_dict["Percentage"],
data_dict["Cases"],
data_dict["Cumulative Percentage"]),
start=1):
for col, acr in enumerate(trace, start=1):
ax.add_patch(plt.Rectangle((col - 0.5, row - 0.45), 1, 0.9,
facecolor=color_dict[acr],
edgecolor='white'))
ax.text(col,
row,
acr,
ha='center',
va='center',
color='white',
fontsize = font_size,
fontweight='bold')
ax.add_patch(plt.Rectangle((
percentage_position -percentage_box_width*2.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position-percentage_box_width*2,
row,
f'{percentage:.2f}%',
ha='center',
va='center',
color='white',
fontsize = font_size+2)
ax.add_patch(plt.Rectangle((
percentage_position - percentage_box_width*1.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position-percentage_box_width,
row,
f'{cases}',
ha='center',
va='center',
color='white',
fontsize = font_size+4)
ax.add_patch(plt.Rectangle((percentage_position-percentage_box_width*0.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position,
row,
f'{cum_percentage:.2f}%',
ha='center',
va='center',
color='white',
fontsize = font_size+2)
ax.set_xlim(0.5, percentage_position+0.5)
ax.set_xticks(range(1, int(percentage_position-1)))
ax.set_ylabel('Traces',fontsize = font_size+3)
ax.set_ylim(len(data_dict["Trace"]) + 0.45, 0.55) # y-axis is reversed
ax.set_yticks([])
ax.set_xlabel('Activities',fontsize = font_size+3)
handles = [plt.Rectangle((0, 0), 0, 0, facecolor=color_dict[acr],
edgecolor='black', label=acronym[acr])
for acr in acronym if acr in set(
itertools.chain.from_iterable(data_dict['Trace']))]
ax.legend(handles=handles,
bbox_to_anchor=[1.02, 1.02],
loc='upper left',
fontsize = font_size+6)
for dir in ['left', 'right', 'top']:
ax.spines[dir].set_visible(False)
plt.tight_layout()
plt.savefig(output_file)
plt.close()
def trace_explorer(con,
cond,
top_k=5,
id_col='ID',
ev_col='Event',
date_col='date',
clust_col='cluster',
color_dict={}):
'''Plot each clusters most frequent traces
Args:
con : db connector variable
cond (str): predominal clinical condition's code
top_k (int): traces as keys and their statistics as values
id_col (str): patients id column's name in evlog_file
ev_col (str): events column's name in evlog_file
date_col (str): events dates column's name in evlog_file
clust_col (str): cluster column's name in evlog_file
color_dict (dict): events as keys and colors as values
'''
log_ = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered").df(),
case_id=id_col,
activity_key=ev_col,
timestamp_key=date_col)
log_ = log_.sort_values([id_col,date_col])
log_ = log_[log_['cycle']=='start']
for clust in set(log_[clust_col]):
log = log_[log_[clust_col]==clust]
len_id = len(set(log[id_col]))
acronym = {t:t for t in sorted(set(log[ev_col]))}
data_dict = make_data_dict(log,top_k,id_col)
color_dict = update_color_dict(color_dict, data_dict)
trace_plotter(data_dict,color_dict,acronym,
'../../outputs/t_cluster_%s_%i.png' % (cond,clust))
return color_dictTo get the process maps of each cluster next R functions can be used:
Code
load_log <- function(con,
query,case_id="ID",
activity_id="Event",
lifecycle_id="cycle",
activity_instance_id="actins",
timestamp="date"){
eventlog <- dbGetQuery(con,query)
eventlog = eventlog[order(eventlog$ID),]
#To transform date to a format we can work with
eventlog$date = as.POSIXct(eventlog$date, tz = "", format="%Y-%m-%d" ,
tryFormats = c("%Y/%m/%d",
"%Y/%m/%d"),
optional = FALSE)
evLog = eventlog %>%
mutate(resource = NA) %>%
mutate(cycle = fct_recode(cycle,
"start" = "start",
"complete" = "end")) %>%
eventlog(
case_id = case_id,
activity_id = activity_id,
lifecycle_id = lifecycle_id,
activity_instance_id = activity_instance_id,
timestamp = timestamp,
resource_id = 'resource'
)
return(evLog)
}
make_process_map <- function(log,t_freq,output_file){
log %>%
filter_activity_frequency(percentage = 1) %>% # show only most frequent
filter_trace_frequency(percentage = t_freq) %>%
process_map(type_nodes = performance(mean,units='days'),
type_edges = frequency('relative_case'),
sec_edges = performance(mean,units='days'),
render = T) %>%
export_svg %>%
charToRaw %>%
rsvg_png (output_file,width=2000)
}
process_map_by_cluster <- function(evLog,t_freq,cond_code){
for (clust in unique(evLog$cluster)) {
log <- evLog %>%
filter(cluster == clust)
make_process_map(log,t_freq,gsub("src/analysis-scripts/",
"",here("outputs",sprintf(
"pm_cluster_%s_%d.png",
cond_code, clust) )))
}
}Conformance checking
Conformance checking is a technique used to check process compliance by comparing event logs for a discovered process with the existing reference model (target model) of the same process. Basing on the DM2 treatment algorithm previous shown, with a software called Carassius , we created the next Petri Nets that are going to be useful as treatment guidelines in reference to glycated hemoglobin measures.
Fitness is the metric that measures how much a trace is distanced from a given process model, or from the guidelines in this case. There are different methods to calculate this metric but in the code below is used the aligned fitness. Since in this metric the initial marking and the final marking have to be fixed we included the events ‘INI’ and ‘FIN’ in the Petri Net and in each trace. Adding this artificial events allows us to compare all traces fitness in the same conditions.
Code
def id2treat_fitness(log ,
net,
initial_marking,
final_marking,
clust_col='cluster',
date_col='date',
ev_col='Event'):
'''Obtain traces fitness
Args:
log (dataframe): event log
net: petri net
initial_marking: initial place in the petri net
final_marking: final place in the petri net
clust_col (str): cluster column's name in log
date_col (str): events dates column's name in log
ev_col (str): events column's name in log
Returns:
df (dataframe): traces, their clusters and their fitnesses
'''
id2trace = {id:list(log['Event'][log['case:concept:name']==id]
) for id in set(log['case:concept:name'])}
id2ids = {id:[id2 for id2 in set(id2trace.keys()) if id2trace[id]==id2trace[id2]
] for id in set(log['case:concept:name'])}
for id in set(id2ids.keys()):
try:
for id2 in id2ids[id]:
if id2!=id:
del id2ids[id2]
except:
continue
id2fitness = dict()
for name in set(id2ids.keys()):
log2 = log.drop(log.index[log['case:concept:name'] !=name])
new = log2.copy().iloc[[0, -1]]
date_list = list(new[date_col])
index = list(new['@@index'])
actins = list(new['actins'])
clust = new[clust_col].values[0]
new[date_col] = new['time:timestamp'] = [date_list[0]-timedelta(days=1),
date_list[1]+timedelta(days=1)]
new[ev_col] = new['concept:name'] = ['INI','FIN']
new['@@index'] = [index[0]-1,index[1]+1]
new['actins'] = [actins[0]-1,actins[1]+1]
log2 = pd.concat([log2,new]).sort_values('time:timestamp')
aligned_fitness = pm4py.conformance_diagnostics_alignments(
log2, net, initial_marking, final_marking)[0]['fitness']
for id in id2ids[name]:
id2fitness[id] = {'ID':id,
'aligned_fitness':aligned_fitness,
clust_col: clust}
df = pd.DataFrame(id2fitness).T
df.index = range(len(df))
return dfThe function below takes a treatments’ event log and returns each patient’s period’s fitness.
Code
def id2treat_fitness_by_interval(con,
cond,
pn_file,
ini_place='place100',
fin_place='place111',
date_n_col='date',
fixed_period_time=90):
'''Obtain traces fitness by intervals of fixed_period_time days
Args:
con: db connector variable
cond (str): predominal clinical condition's code
pn_file (str): petri net's path
ini_place (str): initial place in the petri net
fin_place (str): final place in the petri net
date_n_col (str): events dates column's name in log
fixed_period_time (int): number of days in each interval
'''
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
event_log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust \
WHERE cycle = 'start'").df(),
case_id='ID',
activity_key='Event',
timestamp_key=date_n_col)
event_log = event_log[event_log['cycle']=='start']
baseline = datetime.strptime('2017-01-01', "%Y-%m-%d")
endline = datetime.strptime('2023-01-01', "%Y-%m-%d")
dd = [baseline + timedelta(days=x) for x in range((
endline-baseline).days + 1)]
dd_INI = dd[0]-timedelta(days=1)
dd_FIN = dd[-1]+timedelta(days=1)
event_log['time:timestamp'] = event_log['time:timestamp'].dt.tz_localize(None)
event_log[date_n_col] = event_log[date_n_col].dt.tz_localize(None)
event_log['date_'] = event_log[date_n_col].apply(
lambda x: dd.index(x))
event_log[date_n_col] = event_log[date_n_col].apply(
lambda x: dd.index(x))
event_log[date_n_col] = event_log.groupby('ID')[date_n_col].apply(
lambda x: x-x.min())
hosp = con.sql("SELECT * FROM cmbd_incidents_postdx_first_view").df()
hosp['admission_date'] = hosp['admission_date'].dt.tz_localize(None).apply(
lambda x: dd.index(x))
hosp = dict(zip(hosp.patient_id,hosp.admission_date))
period2fitness = pd.DataFrame()
id_list_df = []
p_start = []
p_end = []
fitness = []
date_0 = []
status = []
id_list = list(set(event_log['ID']))
for id in tqdm(id_list):
event_log_id = event_log[event_log['ID']==id]
hosp_d = hosp.get(id,100000)-min(event_log_id['date_'])
date_max = min(max(event_log_id[date_n_col]),hosp_d)
date_min = min(event_log_id['time:timestamp'])
n_periods = date_max//fixed_period_time
n_periods_r = date_max/fixed_period_time
if date_max<=fixed_period_time:
continue
for n in range(1,n_periods+1):
event_log_id_n = event_log_id[
event_log_id[date_n_col]<n*fixed_period_time]
new = event_log_id.copy().iloc[[0, -1]]
actins = list(new['actins'])
index = list(new['@@index'])
new['time:timestamp'] = [dd_INI,dd_FIN]
new['concept:name'] = ['INI','FIN']
new['@@index'] = [index[0]-1,index[1]+1]
new['actins'] = [actins[0]-1,actins[1]+1]
event_log_id_n = pd.concat([event_log_id_n,new]
).sort_values('time:timestamp')
aligned_fitness = pm4py.conformance_diagnostics_alignments(
event_log_id_n, net,
initial_marking, final_marking)[0]['fitness']
id_list_df.append(id)
p_start.append((n-1)*fixed_period_time)
p_end_value = n*fixed_period_time
if n==n_periods:
p_end_value = date_max-fixed_period_time
p_end.append(p_end_value)
fitness.append(aligned_fitness)
date_0.append(date_min)
if (n==n_periods or n==n_periods_r-1) and hosp_d==date_max :
status.append(1)
break
else:
status.append(0)
period2fitness['ID'] = id_list_df
period2fitness['t_0'] = p_start
period2fitness['t_1'] = p_end
period2fitness['fitness'] = fitness
period2fitness['ini_date'] = date_0
period2fitness['status'] = status
con.sql(f"DROP TABLE IF EXISTS period2fitness_{cond}")
con.sql(f"CREATE TABLE period2fitness_{cond} AS SELECT *,\
MAX(status) OVER (PARTITION BY ID) AS status2 FROM period2fitness \
WHERE t_0!=t_1") In the next function is shown a boxplot function to show clusters’ fitness distribution.
Code
def treatments_clusters_boxplot(con,
cond,
pn_file,
pn_png_file,
ini_place='place100',
fin_place='place111',
date_col='date',
ev_col='Event',
clust_col='cluster',
output_png='../../outputs/fitness_by_cluster.png'):
'''Barplot of the fitness of each cluster
Args:
con: db connector variable
cond (str): predominant clinical condition
pn_file: petri net's path
ini_marking: initial place in the petri net
fin_marking: final place in the petri net
date_col (str): events dates column's name in log
clust_col (str): cluster column's name in log
ev_col (str): events column's name in log
output_png (str): created figure's path
'''
log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered \
WHERE cycle = 'start'").df(),
case_id='ID',
activity_key=ev_col,
timestamp_key=date_col)
log = log.sort_values(date_col)
log.index = range(len(log))
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
df = id2treat_fitness(log,net,initial_marking,final_marking,
clust_col,date_col,ev_col)
df['aligned_fitness'] = pd.to_numeric(df['aligned_fitness'])
df['cluster'] = pd.to_numeric(df['cluster'])
df['ID'] = df['ID'].astype("string")
con.sql(f"DROP TABLE IF EXISTS eventlog_{cond}_byclust")
con.sql(f"CREATE TABLE eventlog_{cond}_byclust AS \
SELECT * FROM df")
data = [list(df['aligned_fitness'][df[clust_col]==i])
for i in sorted(set(df[clust_col]))]
fig = plt.figure(figsize =(10, 7))
# Creating axes instance
ax = fig.add_axes([0, 0, 1, 1])
# Creating plot
bp = ax.boxplot(data,labels=[i for i in sorted(set(df[clust_col]))])
plt.xlabel("Clusters")
plt.ylabel("Aligned Fitness")
# show plot
plt.savefig(output_png,bbox_inches='tight')
plt.close(fig)
Process indicators can also be represented in a Petri Net. Therefore we created the next Petri Nets to analyze whether the traces obey the guidelines the first five years:
The function below takes a process indicators’ event log and returns each patient’s fitness by year. The method to calculate the fitness in this case is the token base replay method.
Code
def id2process_fitness(con,
query,
pn_file,
pn_png_file,
ini_place='place37',
fin_place='place148',
id_col = 'patient_id',
event_col = 'event',
date_col='date',):
'''Calculate the fitness of a given process indicators' event log
Args:
con: db connector variable
cond (str): predominal clinical condition's code
query (str): query to select event log
pn_file (str): petri net's path
ini_place (str): initial place in the petri net
fin_place (str): final place in the petri net
id_col (str): ids' column's name in event log
event_col (str): events' column's name in event log
date_col (str): events' dates' column's name in event log
Returns:
df (dataframe): dataframe of ids' fitness
'''
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
event_log = pm4py.format_dataframe(con.sql(query).df(),
case_id=id_col,
activity_key=event_col,
timestamp_key=date_col)
date_list = []
fit_list = []
id_list = list(set(event_log[id_col]))
for id in tqdm(id_list):
log = event_log.drop(event_log.index[event_log['case:concept:name'] !=id]
).sort_values('time:timestamp')
fit_obj = pm4py.conformance.conformance_diagnostics_token_based_replay(
log, net, initial_marking, final_marking)
date_list.append(list(log['time:timestamp'])[-2])
fit_list.append(fit_obj[0]['trace_fitness'])
log_0 = log[log['concept:name'].isin(['INI','FIN','yFIN'])]
alfa = pm4py.conformance.conformance_diagnostics_token_based_replay(
log_0, net, initial_marking, final_marking)[0]['trace_fitness']
df = pd.DataFrame()
df['ID'] = id_list
df['fitness'] = [(x-alfa)/(1-alfa) for x in fit_list]
df['date'] = date_list
return dfDecision mining
Decision mining allows to analyze the event transitions in different part of processes. With another words, we can measure patients’ characteristics or their relevance in a specific place of the passed petri net. The next function makes a decision tree explaining how patients characteristics are taking into account. Moreover it shows a boxplot of the relevance of each feature in the decision tree obtained with mean decrease impurity which calculates each feature importance as the sum over the number of splits (across all tress) that include the feature, proportionally to the number of samples it splits.
Code
def decision_tree_pn_place(con,
cond,
features_list = ['age','sex','copayment'],
pn_file="./PN.pnml",
place2analyze='place9',
ini_place='place100',
fin_place='place111'):
'''Decision tree and features importances in selected place of PN
Args:
con : db connector variable
cond (str): predominant clinical condition's
features_list (list): features wanted to analyze
event_log_file (str): event log's path
pn_file (str): petri net's path
place2analyze (str): place wanted to analyze in petri net
ini_place (str): initial place in the petri net
fin_place (str): final place in the petri net
'''
log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond} WHERE \
ID in (SELECT ID FROM eventlog_{cond}_clust)").df(), case_id='ID',
activity_key='Event',timestamp_key='date')
log = log[log['cycle']=='start']
log = log[['concept:name','time:timestamp','case:concept:name']]
ini = min(list(log['time:timestamp']))-timedelta(days=1)
fin = max(list(log['time:timestamp']))+timedelta(days=1)
patients = list(set(log['case:concept:name']))
add_ini_fin = {'case:concept:name' : patients*2,
'time:timestamp':[ini]*len(patients)+[fin]*len(patients),
'concept:name':['INI']*len(patients)+['FIN']*len(patients)}
log = pd.concat([log,pd.DataFrame(add_ini_fin)])
log = log.sort_values(['case:concept:name','time:timestamp'])
log.index = range(len(log))
patients_df = con.sql("SELECT patient_id,age,sex,copayment FROM patient_incidents_view").df()
patients_df = patients_df.fillna(2)
patients_df['copayment'] = patients_df['copayment'].values.astype(int).astype(str)
log_patients = log.merge(patients_df,left_on='case:concept:name',
right_on='patient_id',how='left')
log_patients = log_patients[['concept:name', 'time:timestamp',
'case:concept:name', 'age',
'sex','copayment']]
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p)
for p in net.places].index(ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p)
for p in net.places].index(fin_place)]] = 1
try:
X, y, class_names = decision_mining.apply(log_patients,
net,
initial_marking,
final_marking,
decision_point=place2analyze)
clf, feature_names, classes = decision_mining.get_decision_tree(log_patients,
net,
initial_marking,
final_marking,
decision_point=place2analyze)
gviz = tree_visualizer.apply(clf, feature_names, classes)
gviz.save(filename='decision_tree_%s' % cond,
directory='../../outputs')
visualizer.view(gviz)
importances = clf.feature_importances_
tree_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
tree_importances.plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
fig.savefig('../../outputs/barplot_features_importance_%s.png' % cond)
plt.close(fig)
except ValueError as err:
logging.basicConfig(level=logging.INFO)
logging.info('Error in decidsion making: %s', str(err))
if str(err)!='No objects to concatenate':
UNKNOWN_ERRORTime dependent Cox model
The link between guideline adherence, in terms of performed process measures, and clinical outcomes is a highly demanded issue in diabetes care. A Cox model is a statistical technique for exploring the relationship between the survival of a patient and several explanatory variables. One of the strengths of the Cox model is its ability to encompass covariates that change over time, such as treatment adherence. A time dependent Cox model can be made using each patient trace’s fitness at different time interval.
Joint Latent Class Model
Joint models are used to analyse simultaneously two related phenomena, the evolution of a variable and the occurence of an event. Joint latent class models (JLCM) consist of a linear mixed model and a proportional hazard model linked by the latent classes. The population is split in several groups, the latent classes, and each class is caracterized by a specific evolution of the dependent variable and an associated risk of event. Using fitness as time dependent treatment adherence measure we can made a joint latent class model.
Results
Cohort description
Code
###INCIDENTS
patient_inc <- dbGetQuery(con,
"SELECT *,
CASE WHEN sex = 0 THEN 'Male'
ELSE 'Female' END AS sexo FROM patient_incidents_view")
patient_inc <- patient_inc %>%
left_join( country2cont %>% select(country, CC), by = "country") %>%
mutate(education=factor(education, levels=c(0,1,2,3),
labels=c("Without studies","Primary school",
"High school","University")),
copayment=factor(copayment,levels=c(0,1),
labels=c("less than 18000","more or equal than 18000")))
patient_inc$CC[patient_inc$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")
patient_inc%>%summary_factorlist(dependent,
explanatory,
total_col = TRUE,
cont="mean",
cont_cut=7,
na_include = TRUE,
na_to_prop = FALSE,
include_col_totals_percent =FALSE,
add_col_totals = TRUE)-> patient_inc_tab
kable(patient_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Female | Male | Total |
|---|---|---|---|---|
| Total N | 4778 | 6418 | 11196 | |
| age | Mean (SD) | 61.5 (15.4) | 57.9 (13.1) | 59.4 (14.2) |
| age_dx | Mean (SD) | 64.5 (15.2) | 61.0 (13.0) | 62.5 (14.1) |
| CC | AF | 294 (6.4) | 401 (6.3) | 695 (6.4) |
| AS | 35 (0.8) | 63 (1.0) | 98 (0.9) | |
| EU | 164 (3.6) | 226 (3.6) | 390 (3.6) | |
| OC | 1 (0.0) | 1 (0.0) | 2 (0.0) | |
| SA | 433 (9.4) | 318 (5.0) | 751 (6.9) | |
| Spain | 3665 (79.8) | 5332 (84.1) | 8997 (82.3) | |
| copayment | less than 18000 | 3861 (81.5) | 4558 (71.3) | 8419 (75.7) |
| more or equal than 18000 | 875 (18.5) | 1832 (28.7) | 2707 (24.3) | |
| (Missing) | 42 | 28 | 70 | |
| education | Without studies | 875 (22.6) | 886 (16.6) | 1761 (19.1) |
| Primary school | 990 (25.6) | 1332 (25.0) | 2322 (25.2) | |
| High school | 1664 (43.0) | 2601 (48.8) | 4265 (46.4) | |
| University | 342 (8.8) | 510 (9.6) | 852 (9.3) | |
| (Missing) | 907 | 1089 | 1996 | |
| avg_income | Mean (SD) | 14377.1 (2579.1) | 14359.5 (2490.1) | 14366.8 (2527.2) |
Code
param_inc <- dbGetQuery(con,"SELECT * FROM param_incidents_predx_last_view")
param_cat_inc <- dbGetQuery(con,"SELECT * FROM param_cat_incidents_predx_last_view")
dependent="sexo"
param_inc2unnest <- param_inc %>%
pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id")
param_cat_inc$param_cat_value = as.numeric(as.character(param_cat_inc$param_cat_value))
param_cat_inc2unnest <- param_cat_inc %>%
pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value)%>%
left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id") %>%
mutate(physical_activity=factor(physical_activity, levels=c(0,1,2,3),
labels=c("Incapacity", "Inactive", "Partially Active","Active")),
smoking_status=factor(smoking_status, levels=c(0,1,2,3),
labels=c("Non Smoker", "Ex-smoker < 1 year", "Ex-smoker >= 1 year","Smoker")),
alcohol=factor(alcohol, levels=c(0,1,2),
labels=c("Abstinent", "Moderate Drinker","Heavy Drinker")),
vaccination_covid=factor(vaccination_covid, levels=c(0,1,2,3,4,5,6,7),
labels=c("No", "Yes, BioNTech-Pfizer","Yes, Moderna","Yes, Astrazeneca","Yes, Johnson-Johnson","Yes, Novavax","Yes, Other","Yes, Unknown type")),
vaccination_flu=factor(vaccination_flu, levels=c(0,1), labels=c("No", "Yes")),
working_status=factor(working_status,levels=c(0,1,2),labels=c('Active','Unemployed','Pensionist')))
param_inc_tab <- rbind(
param_inc2unnest %>%
summary_factorlist(dependent ,
c("bmi","height","weight","sbp","dbp","hba1c","col",
"hdl","ldl","trgl","creat","indalbcr","alb"),
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,add_row_totals = TRUE,
include_row_missing_col = TRUE,
include_col_totals_percent =FALSE,add_col_totals = TRUE),
param_cat_inc2unnest %>%
summary_factorlist(dependent ,
c("physical_activity","smoking_status","alcohol","vaccination_flu",'working_status'),
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,cont_cut=8,add_row_totals = TRUE,
include_row_missing_col = TRUE))
kable(param_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | Total N | Missing N | levels | Female | Male | Total |
|---|---|---|---|---|---|---|
| Total N | 4320 | 5795 | 10115 | |||
| bmi | 6207 (61.4) | 3908 | Mean (SD) | 32.8 (12.6) | 32.1 (9.8) | 32.4 (11.1) |
| height | 6306 (62.3) | 3809 | Mean (SD) | 156.0 (12.7) | 169.4 (15.2) | 163.6 (15.7) |
| weight | 6726 (66.5) | 3389 | Mean (SD) | 79.5 (18.1) | 92.2 (18.6) | 86.8 (19.4) |
| sbp | 8283 (81.9) | 1832 | Mean (SD) | 134.7 (19.9) | 136.7 (19.6) | 135.8 (19.7) |
| dbp | 8275 (81.8) | 1840 | Mean (SD) | 79.1 (11.4) | 81.6 (12.1) | 80.5 (11.9) |
| hba1c | 7573 (74.9) | 2542 | Mean (SD) | 6.9 (1.4) | 7.2 (1.7) | 7.1 (1.6) |
| col | 9414 (93.1) | 701 | Mean (SD) | 204.9 (43.3) | 196.5 (47.9) | 200.1 (46.2) |
| hdl | 9338 (92.3) | 777 | Mean (SD) | 48.1 (12.2) | 41.7 (10.9) | 44.5 (11.9) |
| ldl | 8629 (85.3) | 1486 | Mean (SD) | 124.8 (35.2) | 118.4 (38.5) | 121.3 (37.2) |
| trgl | 9375 (92.7) | 740 | Mean (SD) | 163.6 (114.6) | 200.8 (221.1) | 184.7 (183.7) |
| creat | 9559 (94.5) | 556 | Mean (SD) | 0.8 (0.3) | 1.0 (0.4) | 0.9 (0.3) |
| indalbcr | 7266 (71.8) | 2849 | Mean (SD) | 29.0 (117.7) | 52.1 (286.6) | 42.0 (229.2) |
| alb | 4332 (42.8) | 5783 | Mean (SD) | 26.3 (18.5) | 26.4 (18.6) | 26.3 (18.6) |
| physical_activity | 3659 (36.5) | 6363 | Incapacity | 31 (1.9) | 24 (1.2) | 55 (1.5) |
| Inactive | 265 (16.1) | 228 (11.3) | 493 (13.5) | |||
| Partially Active | 678 (41.3) | 784 (38.9) | 1462 (40.0) | |||
| Active | 669 (40.7) | 980 (48.6) | 1649 (45.1) | |||
| (Missing) | 2456 | 3907 | 6363 | |||
| smoking_status | 4604 (45.9) | 5418 | Non Smoker | 1368 (69.5) | 973 (36.9) | 2341 (50.8) |
| Ex-smoker < 1 year | 46 (2.3) | 136 (5.2) | 182 (4.0) | |||
| Ex-smoker >= 1 year | 240 (12.2) | 807 (30.6) | 1047 (22.7) | |||
| Smoker | 315 (16.0) | 719 (27.3) | 1034 (22.5) | |||
| (Missing) | 2130 | 3288 | 5418 | |||
| alcohol | 4003 (39.9) | 6019 | Abstinent | 1531 (85.9) | 1063 (47.9) | 2594 (64.8) |
| Moderate Drinker | 239 (13.4) | 1043 (47.0) | 1282 (32.0) | |||
| Heavy Drinker | 12 (0.7) | 115 (5.2) | 127 (3.2) | |||
| (Missing) | 2317 | 3702 | 6019 | |||
| vaccination_flu | 3691 (36.8) | 6331 | No | 0 (0.0) | 0 (0.0) | 0 (0.0) |
| Yes | 1799 (100.0) | 1892 (100.0) | 3691 (100.0) | |||
| (Missing) | 2300 | 4031 | 6331 | |||
| working_status | 8716 (87.0) | 1306 | Active | 1238 (37.1) | 2222 (41.3) | 3460 (39.7) |
| Unemployed | 0 (0.0) | 0 (0.0) | 0 (0.0) | |||
| Pensionist | 2101 (62.9) | 3155 (58.7) | 5256 (60.3) | |||
| (Missing) | 760 | 546 | 1306 |
Code
patient_inc$deregistration_date <- patient_inc$deregistration_date %>%
replace_na(as.Date('2023-01-01'))
patient_inc <- patient_inc %>%
mutate(follow_up_time=as.numeric(difftime(deregistration_date,
dx_date,units="days")/365.25)) %>%
filter(follow_up_time!=0)
ss_use_inc <- dbGetQuery(con," SELECT * FROM main.ss_use \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date >='2017-01-01') AND \
visit_date>= (SELECT dx_date FROM main.patient WHERE \
main.patient.patient_id = main.ss_use.patient_id)") %>% mutate(
visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
labels=c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional")))
ss_use_inc1 <- ss_use_inc %>%
filter(visit_loc == 1)
use_inc_freq1 <- as.data.frame.matrix(table(ss_use_inc1$patient_id,ss_use_inc1$visit_type))
use_inc_freq1 <- cbind(patient_id = rownames(use_inc_freq1), use_inc_freq1)
use_inc_freq1_time <- left_join(use_inc_freq1, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_inc2 <- ss_use_inc %>%
filter(visit_loc == 2)
use_inc_freq2 <- as.data.frame.matrix(table(ss_use_inc2$patient_id,ss_use_inc2$visit_type))
use_inc_freq2 <- cbind(patient_id = rownames(use_inc_freq2), use_inc_freq2)
use_inc_freq2_time <- left_join(use_inc_freq2, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_inc3 <- ss_use_inc %>%
filter(visit_loc == 3)
use_inc_freq3 <- as.data.frame.matrix(table(ss_use_inc3$patient_id,ss_use_inc3$visit_type))
use_inc_freq3 <- cbind(patient_id = rownames(use_inc_freq3), use_inc_freq3)
use_inc_freq3_time <- left_join(use_inc_freq3, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
use_inc_tab <- rbind(
rbind(data.frame(label = "at care center", levels = '',
Male = '', Female = '', Total = ''),
use_inc_freq1_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "at home", levels = '',
Male = '', Female = '', Total = ''),
use_inc_freq2_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "by telephone or similar", levels = '',
Male = '', Female = '', Total = ''),
use_inc_freq3_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)))
kable(use_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Male | Female | Total |
|---|---|---|---|---|
| at care center | ||||
| PC_physician | Mean (SD) | 5.6 (8.2) | 6.1 (8.2) | 5.8 (8.2) |
| PC_nurse | Mean (SD) | 9.4 (12.9) | 8.9 (8.7) | 9.2 (11.3) |
| PC_social_worker | Mean (SD) | 0.2 (1.1) | 0.3 (1.5) | 0.3 (1.3) |
| PC_emergency_service | Mean (SD) | 1.5 (10.5) | 1.6 (5.9) | 1.5 (8.8) |
| PC_others | Mean (SD) | 0.0 (1.2) | 0.0 (0.2) | 0.0 (0.9) |
| Specialized_visit_physician | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_nurse | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_unknown_professional | Mean (SD) | 7.8 (24.9) | 6.5 (16.8) | 7.2 (21.8) |
| at home | ||||
| PC_physician | Mean (SD) | 2.9 (8.3) | 3.4 (9.8) | 3.2 (9.1) |
| PC_nurse | Mean (SD) | 5.0 (14.9) | 6.1 (17.5) | 5.6 (16.4) |
| PC_social_worker | Mean (SD) | 0.1 (0.4) | 0.1 (0.5) | 0.1 (0.5) |
| PC_emergency_service | Mean (SD) | 0.2 (1.2) | 0.2 (1.7) | 0.2 (1.5) |
| PC_others | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_physician | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_nurse | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_unknown_professional | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| by telephone or similar | ||||
| PC_physician | Mean (SD) | 4.8 (9.4) | 5.5 (8.9) | 5.1 (9.1) |
| PC_nurse | Mean (SD) | 2.9 (6.8) | 3.4 (6.4) | 3.1 (6.6) |
| PC_social_worker | Mean (SD) | 0.3 (2.7) | 0.5 (2.1) | 0.4 (2.4) |
| PC_emergency_service | Mean (SD) | 0.2 (1.9) | 0.3 (1.1) | 0.3 (1.6) |
| PC_others | Mean (SD) | 0.0 (0.4) | 0.0 (0.9) | 0.0 (0.7) |
| Specialized_visit_physician | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_nurse | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_unknown_professional | Mean (SD) | 0.1 (1.3) | 0.1 (1.1) | 0.1 (1.2) |
Code
patient_pre <- dbGetQuery(con,
"SELECT *,
CASE WHEN sex = 0 THEN 'Male'
ELSE 'Female' END AS sexo,
FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25)
AS 'age',
FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25)
AS 'age_dx'FROM main.patient
WHERE dx_date < '2017-01-01'")
patient_pre <- patient_pre %>%
left_join( country2cont %>% select(country, CC), by = "country") %>%
mutate(education=factor(education, levels=c(0,1,2,3),
labels=c("Without studies","Primary school",
"High school","University")),
copayment=factor(copayment,levels=c(0,1),
labels=c("less than 18000","more or equal than 18000")))
patient_pre$CC[patient_pre$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")
patient_pre%>%summary_factorlist(dependent,
explanatory,
total_col = TRUE,
cont="mean",
cont_cut=7,
na_include = TRUE,
na_to_prop = FALSE,
include_col_totals_percent =FALSE,
add_col_totals = TRUE)-> patient_pre_tab
kable(patient_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Female | Male | Total |
|---|---|---|---|---|
| Total N | 15454 | 20496 | 35950 | |
| age | Mean (SD) | 72.4 (13.2) | 67.7 (12.1) | 69.7 (12.8) |
| age_dx | Mean (SD) | 62.3 (13.7) | 58.3 (12.4) | 60.0 (13.1) |
| CC | AF | 235 (1.5) | 371 (1.8) | 606 (1.7) |
| AS | 39 (0.3) | 56 (0.3) | 95 (0.3) | |
| EU | 248 (1.6) | 356 (1.7) | 604 (1.7) | |
| SA | 440 (2.9) | 315 (1.5) | 755 (2.1) | |
| Spain | 14354 (93.7) | 19314 (94.6) | 33668 (94.2) | |
| OC | 4 (0.0) | 4 (0.0) | ||
| copayment | less than 18000 | 12174 (79.2) | 12372 (60.5) | 24546 (68.5) |
| more or equal than 18000 | 3197 (20.8) | 8065 (39.5) | 11262 (31.5) | |
| (Missing) | 83 | 59 | 142 | |
| education | Without studies | 3323 (30.5) | 3105 (20.5) | 6428 (24.7) |
| Primary school | 3386 (31.1) | 4660 (30.8) | 8046 (30.9) | |
| High school | 3628 (33.3) | 6155 (40.7) | 9783 (37.6) | |
| University | 543 (5.0) | 1198 (7.9) | 1741 (6.7) | |
| (Missing) | 4574 | 5378 | 9952 | |
| avg_income | Mean (SD) | 14371.0 (2500.8) | 14649.2 (2611.4) | 14532.5 (2569.2) |
Code
param_pre <- dbGetQuery(con,"SELECT * FROM param_prevalents_pre_last_view")
param_cat_pre <- dbGetQuery(con,"SELECT * FROM param_cat_prevalents_pre_last_view")
dependent="sexo"
param_pre2unnest <- param_pre %>%
pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id")
param_cat_pre$param_cat_value = as.numeric(as.character(param_cat_pre$param_cat_value))
param_cat_pre2unnest <- param_cat_pre %>%
pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value)%>%
left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id") %>%
mutate(physical_activity=factor(physical_activity, levels=c(0,1,2,3),
labels=c("Incapacity", "Inactive", "Partially Active","Active")),
smoking_status=factor(smoking_status, levels=c(0,1,2,3),
labels=c("Non Smoker", "Ex-smoker < 1 year", "Ex-smoker >= 1 year","Smoker")),
alcohol=factor(alcohol, levels=c(0,1,2),
labels=c("Abstinent", "Moderate Drinker","Heavy Drinker")),
vaccination_covid=factor(vaccination_covid, levels=c(0,1,2,3,4,5,6,7),
labels=c("No", "Yes, BioNTech-Pfizer","Yes, Moderna","Yes, Astrazeneca","Yes, Johnson-Johnson","Yes, Novavax","Yes, Other","Yes, Unknown type")),
vaccination_flu=factor(vaccination_flu, levels=c(0,1), labels=c("No", "Yes")),
working_status=factor(working_status,levels=c(0,1,2),labels=c('Active','Unemployed','Pensionist')))
param_pre_tab <- rbind(
param_pre2unnest %>%
summary_factorlist(dependent ,
c("bmi","height","weight","sbp","dbp","hba1c","col",
"hdl","ldl","trgl","creat","indalbcr","alb"),
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,add_row_totals = TRUE,
include_row_missing_col = TRUE,
include_col_totals_percent =FALSE,add_col_totals = TRUE),
param_cat_pre2unnest %>%
summary_factorlist(dependent ,
c("physical_activity","smoking_status","alcohol","vaccination_flu",'working_status'),
total_col = TRUE,cont="mean",na_include = TRUE,
na_to_prop = FALSE,cont_cut=8,add_row_totals = TRUE,
include_row_missing_col = TRUE))
kable(param_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | Total N | Missing N | levels | Female | Male | Total |
|---|---|---|---|---|---|---|
| Total N | 13234 | 16902 | 30136 | |||
| bmi | 17036 (56.5) | 13100 | Mean (SD) | 31.3 (30.7) | 31.4 (97.6) | 31.3 (76.6) |
| height | 17199 (57.1) | 12937 | Mean (SD) | 154.3 (11.5) | 167.6 (14.1) | 161.9 (14.6) |
| weight | 19419 (64.4) | 10717 | Mean (SD) | 79.0 (501.4) | 85.5 (15.5) | 82.7 (328.5) |
| sbp | 25078 (83.2) | 5058 | Mean (SD) | 135.4 (18.0) | 135.0 (16.9) | 135.2 (17.4) |
| dbp | 25043 (83.1) | 5093 | Mean (SD) | 75.5 (10.6) | 76.4 (10.7) | 76.0 (10.7) |
| hba1c | 24086 (79.9) | 6050 | Mean (SD) | 7.0 (1.2) | 7.0 (1.2) | 7.0 (1.2) |
| col | 25567 (84.8) | 4569 | Mean (SD) | 184.4 (38.6) | 171.4 (40.9) | 177.1 (40.4) |
| hdl | 25438 (84.4) | 4698 | Mean (SD) | 48.2 (12.5) | 43.1 (11.4) | 45.3 (12.2) |
| ldl | 23910 (79.3) | 6226 | Mean (SD) | 108.1 (31.9) | 100.6 (31.3) | 104.0 (31.8) |
| trgl | 25474 (84.5) | 4662 | Mean (SD) | 146.9 (85.4) | 147.9 (115.5) | 147.5 (103.4) |
| creat | 24849 (82.5) | 5287 | Mean (SD) | 0.9 (1.5) | 1.1 (2.0) | 1.0 (1.8) |
| indalbcr | 21275 (70.6) | 8861 | Mean (SD) | 52.2 (257.2) | 71.3 (293.3) | 63.2 (278.7) |
| alb | 4346 (14.4) | 25790 | Mean (SD) | 4.1 (0.4) | 4.2 (1.0) | 4.2 (0.7) |
| physical_activity | 9543 (30.7) | 21573 | Incapacity | 82 (1.9) | 49 (0.9) | 131 (1.4) |
| Inactive | 607 (14.4) | 410 (7.7) | 1017 (10.7) | |||
| Partially Active | 1837 (43.6) | 1820 (34.2) | 3657 (38.3) | |||
| Active | 1690 (40.1) | 3048 (57.2) | 4738 (49.6) | |||
| (Missing) | 8811 | 12762 | 21573 | |||
| smoking_status | 12679 (40.7) | 18437 | Non Smoker | 4355 (81.5) | 2970 (40.5) | 7325 (57.8) |
| Ex-smoker < 1 year | 57 (1.1) | 224 (3.1) | 281 (2.2) | |||
| Ex-smoker >= 1 year | 438 (8.2) | 2571 (35.0) | 3009 (23.7) | |||
| Smoker | 492 (9.2) | 1572 (21.4) | 2064 (16.3) | |||
| (Missing) | 7685 | 10752 | 18437 | |||
| alcohol | 12184 (39.2) | 18932 | Abstinent | 4790 (89.9) | 3268 (47.7) | 8058 (66.1) |
| Moderate Drinker | 527 (9.9) | 3337 (48.7) | 3864 (31.7) | |||
| Heavy Drinker | 11 (0.2) | 251 (3.7) | 262 (2.2) | |||
| (Missing) | 7699 | 11233 | 18932 | |||
| vaccination_flu | 18067 (58.1) | 13049 | No | 0 (0.0) | 0 (0.0) | 0 (0.0) |
| Yes | 8229 (100.0) | 9838 (100.0) | 18067 (100.0) | |||
| (Missing) | 4798 | 8251 | 13049 | |||
| working_status | 24421 (78.5) | 6695 | Active | 1446 (15.4) | 2927 (19.4) | 4373 (17.9) |
| Unemployed | 0 (0.0) | 0 (0.0) | 0 (0.0) | |||
| Pensionist | 7922 (84.6) | 12126 (80.6) | 20048 (82.1) | |||
| (Missing) | 3659 | 3036 | 6695 |
Code
patient_pre$deregistration_date <- patient_pre$deregistration_date %>%
replace_na(as.Date('2023-01-01'))
patient_pre <- patient_pre %>%
mutate(follow_up_time=as.numeric(difftime(deregistration_date, as.Date('2016-12-31'),units="days")/365.25))%>%
filter(follow_up_time!=0)
ss_use_pre <- dbGetQuery(con," SELECT * FROM main.ss_use \
WHERE patient_id IN (SELECT patient_id FROM main.patient \
WHERE dx_date <'2017-01-01') AND
visit_date>='2017-01-01'") %>% mutate(
visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
labels=c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"))
)
ss_use_pre1 <- ss_use_pre %>%
filter(visit_loc == 1)
use_pre_freq1 <- as.data.frame.matrix(table(ss_use_pre1$patient_id,ss_use_pre1$visit_type))
use_pre_freq1 <- cbind(patient_id = rownames(use_pre_freq1), use_pre_freq1)
use_pre_freq1_time <- left_join(use_pre_freq1, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_pre2 <- ss_use_pre %>%
filter(visit_loc == 2)
use_pre_freq2 <- as.data.frame.matrix(table(ss_use_pre2$patient_id,ss_use_pre2$visit_type))
use_pre_freq2 <- cbind(patient_id = rownames(use_pre_freq2), use_pre_freq2)
use_pre_freq2_time <- left_join(use_pre_freq2, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
ss_use_pre3 <- ss_use_pre %>%
filter(visit_loc == 3)
use_pre_freq3 <- as.data.frame.matrix(table(ss_use_pre3$patient_id,ss_use_pre3$visit_type))
use_pre_freq3 <- cbind(patient_id = rownames(use_pre_freq3), use_pre_freq3)
use_pre_freq3_time <- left_join(use_pre_freq3, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
mutate(PC_physician= PC_physician/follow_up_time,
PC_nurse=PC_nurse/follow_up_time,
PC_social_worker=PC_social_worker/follow_up_time,
PC_emergency_service=PC_emergency_service/follow_up_time,
PC_others=PC_others/follow_up_time,
Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female")))
use_pre_tab <- rbind(
rbind(data.frame(label = "at care center", levels = '',
Male = '', Female = '', Total = ''),
use_pre_freq1_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "at home", levels = '',
Male = '', Female = '', Total = ''),
use_pre_freq2_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)),
rbind(data.frame(label = "by telephone or similar", levels = '',
Male = '', Female = '', Total = ''),
use_pre_freq3_time %>%
summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
"PC_emergency_service","PC_others",
"Specialized_visit_physician",
"Specialized_visit_nurse",
"Specialized_visit_unknown_professional"),
total_col = TRUE,cont="mean",cont_cut=1,
na_include = TRUE,na_to_prop = FALSE)))
kable(use_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))| label | levels | Male | Female | Total |
|---|---|---|---|---|
| at care center | ||||
| PC_physician | Mean (SD) | 5.1 (5.7) | 5.6 (6.3) | 5.3 (6.0) |
| PC_nurse | Mean (SD) | 8.7 (9.3) | 8.2 (8.1) | 8.5 (8.8) |
| PC_social_worker | Mean (SD) | 0.2 (0.7) | 0.3 (0.8) | 0.3 (0.8) |
| PC_emergency_service | Mean (SD) | 1.6 (7.5) | 1.8 (7.4) | 1.7 (7.5) |
| PC_others | Mean (SD) | 0.0 (0.1) | 0.0 (0.1) | 0.0 (0.1) |
| Specialized_visit_physician | Mean (SD) | 0.0 (0.3) | 0.0 (0.8) | 0.0 (0.6) |
| Specialized_visit_nurse | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_unknown_professional | Mean (SD) | 9.0 (17.0) | 7.4 (14.2) | 8.3 (15.9) |
| at home | ||||
| PC_physician | Mean (SD) | 2.1 (6.4) | 2.4 (7.8) | 2.3 (7.1) |
| PC_nurse | Mean (SD) | 4.4 (12.3) | 5.5 (13.3) | 5.0 (12.8) |
| PC_social_worker | Mean (SD) | 0.1 (0.4) | 0.1 (0.4) | 0.1 (0.4) |
| PC_emergency_service | Mean (SD) | 0.2 (2.7) | 0.3 (2.8) | 0.3 (2.7) |
| PC_others | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_physician | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_nurse | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_unknown_professional | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| by telephone or similar | ||||
| PC_physician | Mean (SD) | 3.8 (5.8) | 4.8 (5.7) | 4.3 (5.7) |
| PC_nurse | Mean (SD) | 3.1 (5.9) | 3.8 (4.8) | 3.4 (5.5) |
| PC_social_worker | Mean (SD) | 0.3 (1.5) | 0.5 (1.4) | 0.4 (1.5) |
| PC_emergency_service | Mean (SD) | 0.2 (1.0) | 0.2 (0.9) | 0.2 (0.9) |
| PC_others | Mean (SD) | 0.0 (0.1) | 0.0 (0.1) | 0.0 (0.1) |
| Specialized_visit_physician | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_nurse | Mean (SD) | 0.0 (0.0) | 0.0 (0.0) | 0.0 (0.0) |
| Specialized_visit_unknown_professional | Mean (SD) | 0.1 (0.3) | 0.1 (0.3) | 0.1 (0.3) |
Patients without predominant clinical condition
Event log’s creation and description
Choosing patients without any predominant clinical condition and after some preprocessing we obtain an event log that can be represented in the below process map Figure 12. There is shown how the events are connected, the mean number of days patients spent in each event (or treatment), percentage of patients who made each transition and the mean number of days it took to make the transition. However, a spaghetti map is obtained and nothing can be concluded. Therefore, we have to simplify the process map and for example only show the most frequent traces covering 20% of the event log as in Figure 13. Moreover, in Figure 14 we show percentage of patients’ traces each activity is present.
Clustering traces
Once the set of traces to analyze are selected, there is a need to choose a distance measure to clustering. In this example Levenshtein similarity is chosen to calculate the distance matrix.
When distance matrix is acquired, we are able to cluster. However, the number of clusters have to be fixed before. With these figures we are able to conclude what could be the optimal number of clusters.
Choosing the optimal number of clusters, too small clusters can appear, but we can exclude those of less than 30 traces to avoid having clusters of low representation. The process map and the most frequent traces of one of these clusters that remain are the followings.
Conformace checking
Once traces are clusterized, with a boxplot is easy to show that each cluster’s behavior with respect to the treatment guides is different. Comparing traces with Figure 6, calculating the fitness (1 being perfect match with the petri net and 0 being the lowest possible fitness) of each trace and grouping by cluster results in_ Figure 17.
Decision mining
In Figure 18 is shown how patients’ age, sex and copayment would influence in prescribing three drugs based treatment. More features could have been added but for simplicity we included those three.
Variables relevance in the previous decision tree is indicated in Figure 19. As we can see, categorical variables are divided into as many parts as the number of categories there are and variables’ total relevance is the sum of its categories’ values.
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 18857, number of events= 990
coef exp(coef) se(coef) z Pr(>|z|)
age -0.008555 0.991482 0.003088 -2.770 0.0056 **
sex1 0.055093 1.056639 0.065477 0.841 0.4001
fitness -0.405129 0.666891 0.190500 -2.127 0.0334 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 0.9915 1.0086 0.9855 0.9975
sex1 1.0566 0.9464 0.9294 1.2013
fitness 0.6669 1.4995 0.4591 0.9687
Concordance= 0.531 (se = 0.01 )
Likelihood ratio test= 14.24 on 3 df, p=0.003
Wald test = 14.59 on 3 df, p=0.002
Score (logrank) test = 14.6 on 3 df, p=0.002
Using same predictive variables a joint latent class model of three classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 2854
Number of observations: 18857
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 990
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 1
Convergence criteria: parameters= 4.8e-18
: likelihood= 2.2e-11
: second derivatives= 2e-08
Goodness-of-fit statistics:
maximum log-likelihood: 17669.74
AIC: -35311.49
BIC: -35228.1
Score test statistic for CI assumption: 1.275 (p-value=0.5285)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -1.59186 0.07111 -22.387 0.00000
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.02946 0.00286 10.318 0.00000
event1 +/-sqrt(Weibull2) 0.94573 0.01249 75.710 0.00000
event1 SurvPH class1 0.47427 0.10197 4.651 0.00000
age -0.00829 0.00308 -2.689 0.00717
sex1 0.04925 0.06544 0.753 0.45168
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.76526 0.01004 76.193 0.00000
intercept class2 0.91381 0.00338 270.658 0.00000
t_0 class1 -0.00043 0.00001 -32.415 0.00000
t_0 class2 -0.00003 0.00000 -7.818 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.01839
t_0 -0.00001 0
coef Se
Residual standard error 0.04078 0.00024
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 405.00 2449.00
% 14.19 85.81
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.8556 0.1444
class2 0.0556 0.9444
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 79.26 94.69
prob>0.8 66.91 90.20
prob>0.9 55.06 83.26
Posterior classification based only on longitudinal data:
class1 class2
N 386.00 2468.00
% 13.52 86.48
png
2
CV Disease
Event log’s creation and description
Choosing patients with a cardiovascular disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Decision mining
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 2854, number of events= 261
coef exp(coef) se(coef) z Pr(>|z|)
age -0.000683 0.999317 0.005923 -0.115 0.908
sex1 -0.006853 0.993170 0.138269 -0.050 0.960
fitness -0.080073 0.923049 0.467042 -0.171 0.864
exp(coef) exp(-coef) lower .95 upper .95
age 0.9993 1.001 0.9878 1.011
sex1 0.9932 1.007 0.7574 1.302
fitness 0.9230 1.083 0.3696 2.306
Concordance= 0.513 (se = 0.02 )
Likelihood ratio test= 0.05 on 3 df, p=1
Wald test = 0.05 on 3 df, p=1
Score (logrank) test = 0.05 on 3 df, p=1
Using same predictive variables a joint latent class model of three classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 462
Number of observations: 2854
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 261
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 1
Convergence criteria: parameters= 8.3e-11
: likelihood= 2.3e-05
: second derivatives= 3.2e-06
Goodness-of-fit statistics:
maximum log-likelihood: 1586.47
AIC: -3144.94
BIC: -3087.04
Score test statistic for CI assumption: 4.88 (p-value=0.0872)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.63657 0.39696 -1.604 0.10880
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.02924 0.00618 4.733 0.00000
event1 +/-sqrt(Weibull2) 1.01006 0.03934 25.672 0.00000
event1 SurvPH class1 1.62751 0.34931 4.659 0.00000
age -0.00401 0.00692 -0.579 0.56288
sex1 0.06002 0.16440 0.365 0.71505
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.60431 0.01214 49.795 0.00000
intercept class2 0.58151 0.00866 67.156 0.00000
t_0 class1 -0.00052 0.00007 -7.196 0.00000
t_0 class2 -0.00022 0.00002 -9.623 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.01213
t_0 0.00000 0
coef Se
Residual standard error 0.04784 0.00075
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 164.0 298.0
% 35.5 64.5
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.6999 0.3001
class2 0.1512 0.8488
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 45.12 76.17
prob>0.8 15.85 63.09
prob>0.9 3.66 52.68
Posterior classification based only on longitudinal data:
class1 class2
N 87.00 375.00
% 18.83 81.17
png
2
Heart Failure
Event log’s creation and description
Choosing patients with heart failure and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Decision mining


Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 1166, number of events= 126
coef exp(coef) se(coef) z Pr(>|z|)
age 0.018126 1.018291 0.008138 2.227 0.0259 *
sex1 -0.227212 0.796752 0.185171 -1.227 0.2198
fitness -1.035612 0.355009 0.890678 -1.163 0.2449
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0183 0.982 1.00218 1.035
sex1 0.7968 1.255 0.55425 1.145
fitness 0.3550 2.817 0.06196 2.034
Concordance= 0.568 (se = 0.029 )
Likelihood ratio test= 6.71 on 3 df, p=0.08
Wald test = 6.46 on 3 df, p=0.09
Score (logrank) test = 6.47 on 3 df, p=0.09
Using same predictive variables a joint latent class model of three classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 210
Number of observations: 1166
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 126
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 1
Convergence criteria: parameters= 1e-18
: likelihood= 2.3e-13
: second derivatives= 1.2e-11
Goodness-of-fit statistics:
maximum log-likelihood: 530.95
AIC: -1033.89
BIC: -987.04
Score test statistic for CI assumption: 6.543 (p-value=0.0379)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.19870 0.28644 -0.694 0.48786
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.01661 0.00568 2.924 0.00346
event1 +/-sqrt(Weibull2) 1.00354 0.04370 22.965 0.00000
event1 SurvPH class1 1.65383 0.30592 5.406 0.00000
age 0.01181 0.00907 1.302 0.19287
sex1 0.05210 0.21029 0.248 0.80432
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.62505 0.01381 45.268 0.00000
intercept class2 0.57986 0.01318 43.986 0.00000
t_0 class1 -0.00064 0.00006 -10.823 0.00000
t_0 class2 -0.00025 0.00002 -10.255 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.0104
t_0 0.0000 0
coef Se
Residual standard error 0.04711 0.00119
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 101.0 109.0
% 48.1 51.9
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.8045 0.1955
class2 0.1224 0.8776
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 85.15 78.90
prob>0.8 52.48 73.39
prob>0.9 20.79 62.39
Posterior classification based only on longitudinal data:
class1 class2
N 81.00 129.00
% 38.57 61.43
png
2
Chronic kidney disease
Event log’s creation and description
Choosing patients with chronic kidney disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Decision mining


Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 1131, number of events= 118
coef exp(coef) se(coef) z Pr(>|z|)
age 0.005870 1.005887 0.009487 0.619 0.536
sex1 0.076895 1.079928 0.197359 0.390 0.697
fitness -0.153373 0.857810 0.804843 -0.191 0.849
exp(coef) exp(-coef) lower .95 upper .95
age 1.0059 0.9941 0.9874 1.025
sex1 1.0799 0.9260 0.7335 1.590
fitness 0.8578 1.1658 0.1771 4.154
Concordance= 0.513 (se = 0.033 )
Likelihood ratio test= 0.79 on 3 df, p=0.9
Wald test = 0.78 on 3 df, p=0.9
Score (logrank) test = 0.78 on 3 df, p=0.9
Using same predictive variables a joint latent class model of three classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 242
Number of observations: 1131
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 118
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 1
Convergence criteria: parameters= 4.6e-11
: likelihood= 5.6e-07
: second derivatives= 8.1e-08
Goodness-of-fit statistics:
maximum log-likelihood: 549.78
AIC: -1071.56
BIC: -1022.72
Score test statistic for CI assumption: 0.083 (p-value=0.9593)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.93966 0.51917 -1.810 0.07031
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.02621 0.00943 2.781 0.00542
event1 +/-sqrt(Weibull2) 1.01488 0.04622 21.958 0.00000
event1 SurvPH class1 1.33861 0.36856 3.632 0.00028
age 0.00432 0.01022 0.423 0.67230
sex1 0.03075 0.20970 0.147 0.88341
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.60418 0.01885 32.052 0.00000
intercept class2 0.58567 0.01013 57.818 0.00000
t_0 class1 -0.00064 0.00010 -6.221 0.00000
t_0 class2 -0.00025 0.00003 -8.815 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.01152
t_0 0.00000 0
coef Se
Residual standard error 0.04398 0.00117
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 55.00 187.00
% 22.73 77.27
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.6624 0.3376
class2 0.1688 0.8312
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 34.55 79.68
prob>0.8 16.36 61.50
prob>0.9 7.27 42.78
Posterior classification based only on longitudinal data:
class1 class2
N 33.00 209.00
% 13.64 86.36
png
2
Frailty
Event log’s creation and description
Choosing patients with frailty and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Decision mining


Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 6535, number of events= 416
coef exp(coef) se(coef) z Pr(>|z|)
age -0.024846 0.975460 0.010359 -2.398 0.0165 *
sex1 -0.005628 0.994388 0.101149 -0.056 0.9556
fitness 0.392983 1.481394 0.229867 1.710 0.0873 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 0.9755 1.025 0.9559 0.9955
sex1 0.9944 1.006 0.8156 1.2124
fitness 1.4814 0.675 0.9441 2.3245
Concordance= 0.547 (se = 0.015 )
Likelihood ratio test= 9.86 on 3 df, p=0.02
Wald test = 9.67 on 3 df, p=0.02
Score (logrank) test = 9.69 on 3 df, p=0.02
Using same predictive variables a joint latent class model of three classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 1080
Number of observations: 6535
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 416
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 8
Convergence criteria: parameters= 2.2e-07
: likelihood= 9.6e-06
: second derivatives= 1.9e-09
Goodness-of-fit statistics:
maximum log-likelihood: 4983.84
AIC: -9939.69
BIC: -9869.9
Score test statistic for CI assumption: 12.927 (p-value=0.0016)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.80123 0.06695 -11.967 0.00000
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.07159 0.03099 2.310 0.02088
event1 +/-sqrt(Weibull2) 0.98092 0.01954 50.212 0.00000
event1 SurvPH class1 0.24610 0.10677 2.305 0.02116
age -0.02436 0.01067 -2.283 0.02244
sex1 -0.00839 0.11212 -0.075 0.94035
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.97253 0.00443 219.532 0.00000
intercept class2 0.62371 0.00289 216.048 0.00000
t_0 class1 -0.00007 0.00002 -3.831 0.00013
t_0 class2 -0.00021 0.00001 -18.567 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.00477
t_0 0.00000 0
coef Se
Residual standard error 0.04363 0.00045
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 337.0 743.0
% 31.2 68.8
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.9833 0.0167
class2 0.0043 0.9957
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 97.63 99.73
prob>0.8 97.33 99.46
prob>0.9 97.03 98.79
Posterior classification based only on longitudinal data:
class1 class2
N 336.00 744.00
% 31.11 68.89
png
2
Obesity
Event log’s creation and description
Choosing patients with obesity and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Decision mining
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 12121, number of events= 717
coef exp(coef) se(coef) z Pr(>|z|)
age -0.011723 0.988345 0.003468 -3.381 0.000723 ***
sex1 0.064597 1.066729 0.075190 0.859 0.390277
fitness -0.132166 0.876195 0.172409 -0.767 0.443327
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 0.9883 1.0118 0.9817 0.9951
sex1 1.0667 0.9374 0.9206 1.2361
fitness 0.8762 1.1413 0.6250 1.2284
Concordance= 0.534 (se = 0.013 )
Likelihood ratio test= 12.47 on 3 df, p=0.006
Wald test = 12.77 on 3 df, p=0.005
Score (logrank) test = 12.78 on 3 df, p=0.005
Using same predictive variables a joint latent class model of three classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 1956
Number of observations: 12121
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 717
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 6
Convergence criteria: parameters= 9.7e-08
: likelihood= 2e-05
: second derivatives= 9.9e-09
Goodness-of-fit statistics:
maximum log-likelihood: 8370.54
AIC: -16713.09
BIC: -16634.98
Score test statistic for CI assumption: 15.455 (p-value=4e-04)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 0.43719 0.04950 8.832 0.00000
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.03611 0.00381 9.480 0.00000
event1 +/-sqrt(Weibull2) 0.95903 0.01477 64.925 0.00000
event1 SurvPH class1 0.00986 0.08130 0.121 0.90344
age -0.01204 0.00348 -3.462 0.00054
sex1 0.07040 0.07527 0.935 0.34964
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.61656 0.00295 209.090 0.00000
intercept class2 0.96074 0.00384 250.358 0.00000
t_0 class1 -0.00016 0.00001 -16.278 0.00000
t_0 class2 -0.00017 0.00001 -13.119 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.00724
t_0 0.00000 0
coef Se
Residual standard error 0.04800 0.00036
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 1201.0 755.0
% 61.4 38.6
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.9736 0.0264
class2 0.0254 0.9746
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 97.67 97.75
prob>0.8 96.34 97.48
prob>0.9 89.76 91.39
Posterior classification based only on longitudinal data:
class1 class2
N 1201.0 755.0
% 61.4 38.6
png
2
Process indicators’ analysis
We selected incident patients with at least one year of follow-up to create a process indicator’s event log. This log includes cholesterol, albumin-creatinine index, glycated hemoglobin, body mass index, blood pressure and glomerular filtration measures, and foot and eye examinations.
An activity presence analysis is done to see the percentage of presence of each indicator in analyzed traces.
Conformace checking
As has been done before, in this section, the observed traces are compared with a specific theoretical process. However, instead of utilizing a single Petri net, five Petri nets are employed, one for each year, as the adherence is now considered in annual intervals from diabetes detection date.
Treatments’ analysis’ and process indicators’ fitness are linked with the next plot:
Prediction Models
In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependent Cox model, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 13661, number of events= 2503
coef exp(coef) se(coef) z Pr(>|z|)
age 0.013055 1.013140 0.001561 8.362 < 2e-16 ***
sex1 0.024150 1.024444 0.040948 0.590 0.555343
fitness 0.255482 1.291083 0.075443 3.386 0.000708 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.013 0.9870 1.0100 1.016
sex1 1.024 0.9761 0.9454 1.110
fitness 1.291 0.7745 1.1136 1.497
Concordance= 0.55 (se = 0.006 )
Likelihood ratio test= 87.23 on 3 df, p=<2e-16
Wald test = 86.42 on 3 df, p=<2e-16
Score (logrank) test = 86.5 on 3 df, p=<2e-16
Using same predictive variables a joint latent class model of three classes is made:
Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0,
subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~
age + sex, hazard = "Weibull", hazardtype = "PH", data = df,
verbose = FALSE)
Statistical Model:
Dataset: df
Number of subjects: 6144
Number of observations: 13661
Number of latent classes: 2
Number of parameters: 14
Event 1:
Number of events: 2503
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 1
Convergence criteria: parameters= 1.2e-11
: likelihood= 7.3e-05
: second derivatives= 6.9e-06
Goodness-of-fit statistics:
maximum log-likelihood: -15971.76
AIC: 31971.52
BIC: 32065.65
Score test statistic for CI assumption: 1.365 (p-value=0.5053)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 1.45062 0.03838 37.794 0.00000
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.01496 0.00090 16.654 0.00000
event1 +/-sqrt(Weibull2) 0.95739 0.00809 118.361 0.00000
event1 SurvPH class1 0.17184 0.05716 3.006 0.00264
age 0.01308 0.00156 8.362 0.00000
sex1 0.02767 0.04080 0.678 0.49756
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.79629 0.00265 300.629 0.00000
intercept class2 0.18936 0.00647 29.253 0.00000
t_0 class1 -0.00006 0.00000 -13.953 0.00000
t_0 class2 0.00026 0.00001 26.066 0.00000
Variance-covariance matrix of the random-effects:
intercept t_0
intercept 0.0199
t_0 0.0000 0
coef Se
Residual standard error 0.08374 0.00097
Posterior classification based on longitudinal and time-to-event data:
class1 class2
N 4994.00 1150.00
% 81.28 18.72
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2
class1 0.9798 0.0202
class2 0.0733 0.9267
Posterior probabilities above a threshold (%):
class1 class2
prob>0.7 98.02 89.22
prob>0.8 97.20 84.09
prob>0.9 94.79 78.70
Posterior classification based only on longitudinal data:
class1 class2
N 4990.00 1154.00
% 81.22 18.78
png
2